package cv
- Alphabetic
- Public
- All
Type Members
-
class
BLIPForQuestionAnswering extends AnnotatorModel[BLIPForQuestionAnswering] with HasBatchedAnnotateImage[BLIPForQuestionAnswering] with HasImageFeatureProperties with WriteTensorflowModel with HasEngine
BLIPForQuestionAnswering can load BLIP models for visual question answering.
BLIPForQuestionAnswering can load BLIP models for visual question answering. The model consists of a vision encoder, a text encoder as well as a text decoder. The vision encoder will encode the input image, the text encoder will encode the input question together with the encoding of the image, and the text decoder will output the answer to the question.
Pretrained models can be loaded with
pretrained
of the companion object:val visualQAClassifier = BLIPForQuestionAnswering.pretrained() .setInputCols("image_assembler") .setOutputCol("answer")
The default model is
"blip_vqa_base"
, if no name is provided.For available pretrained models please see the Models Hub.
Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/test/scala/com/johnsnowlabs/nlp/annotators/cv/BLIPForQuestionAnsweringTest.scala.
Example
import spark.implicits._ import com.johnsnowlabs.nlp.base._ import com.johnsnowlabs.nlp.annotator._ import org.apache.spark.ml.Pipeline val imageDF: DataFrame = ResourceHelper.spark.read .format("image") .option("dropInvalid", value = true) .load(imageFolder) val testDF: DataFrame = imageDF.withColumn("text", lit("What's this picture about?")) val imageAssembler: ImageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val visualQAClassifier = BLIPForQuestionAnswering.pretrained() .setInputCols("image_assembler") .setOutputCol("answer") val pipeline = new Pipeline().setStages(Array( imageAssembler, visualQAClassifier )) val result = pipeline.fit(testDF).transform(testDF) result.select("image_assembler.origin", "answer.result").show(false) +--------------------------------------+------+ |origin |result| +--------------------------------------+------+ |[file:///content/images/cat_image.jpg]|[cats]| +--------------------------------------+------+
- See also
CLIPForZeroShotClassification for Zero Shot Image Classifier
Annotators Main Page for a list of transformer based classifiers
-
class
CLIPForZeroShotClassification extends AnnotatorModel[CLIPForZeroShotClassification] with HasBatchedAnnotateImage[CLIPForZeroShotClassification] with HasImageFeatureProperties with WriteTensorflowModel with WriteOnnxModel with WriteOpenvinoModel with HasEngine with HasRescaleFactor
Zero Shot Image Classifier based on CLIP.
Zero Shot Image Classifier based on CLIP.
CLIP (Contrastive Language-Image Pre-Training) is a neural network that was trained on image and text pairs. It has the ability to predict images without training on any hard-coded labels. This makes it very flexible, as labels can be provided during inference. This is similar to the zero-shot capabilities of the GPT-2 and 3 models.
Pretrained models can be loaded with
pretrained
of the companion object:val imageClassifier = CLIPForZeroShotClassification.pretrained() .setInputCols("image_assembler") .setOutputCol("label")
The default model is
"zero_shot_classifier_clip_vit_base_patch32"
, if no name is provided.For available pretrained models please see the Models Hub.
Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see CLIPForZeroShotClassificationTestSpec.
Example
import com.johnsnowlabs.nlp.ImageAssembler import com.johnsnowlabs.nlp.annotator._ import org.apache.spark.ml.Pipeline val imageDF = ResourceHelper.spark.read .format("image") .option("dropInvalid", value = true) .load("src/test/resources/image/") val imageAssembler: ImageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val candidateLabels = Array( "a photo of a bird", "a photo of a cat", "a photo of a dog", "a photo of a hen", "a photo of a hippo", "a photo of a room", "a photo of a tractor", "a photo of an ostrich", "a photo of an ox") val imageClassifier = CLIPForZeroShotClassification .pretrained() .setInputCols("image_assembler") .setOutputCol("label") .setCandidateLabels(candidateLabels) val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)).fit(imageDF).transform(imageDF) pipeline .selectExpr("reverse(split(image.origin, '/'))[0] as image_name", "label.result") .show(truncate = false) +-----------------+-----------------------+ |image_name |result | +-----------------+-----------------------+ |palace.JPEG |[a photo of a room] | |egyptian_cat.jpeg|[a photo of a cat] | |hippopotamus.JPEG|[a photo of a hippo] | |hen.JPEG |[a photo of a hen] | |ostrich.JPEG |[a photo of an ostrich]| |junco.JPEG |[a photo of a bird] | |bluetick.jpg |[a photo of a dog] | |chihuahua.jpg |[a photo of a dog] | |tractor.JPEG |[a photo of a tractor] | |ox.JPEG |[a photo of an ox] | +-----------------+-----------------------+
-
class
ConvNextForImageClassification extends SwinForImageClassification
ConvNextForImageClassification is an image classifier based on ConvNet models.
ConvNextForImageClassification is an image classifier based on ConvNet models.
The ConvNeXT model was proposed in A ConvNet for the 2020s by Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie. ConvNeXT is a pure convolutional model (ConvNet), inspired by the design of Vision Transformers, that claims to outperform them.
Pretrained models can be loaded with
pretrained
of the companion object:val imageClassifier = ConvNextForImageClassification.pretrained() .setInputCols("image_assembler") .setOutputCol("class")
The default model is
"image_classifier_convnext_tiny_224_local"
, if no name is provided.For available pretrained models please see the Models Hub.
Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see ConvNextForImageClassificationTestSpec.
References:
Paper Abstract:
The "Roaring 20s" of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly superseded ConvNets as the state-of-the-art image classification model. A vanilla ViT, on the other hand, faces difficulties when applied to general computer vision tasks such as object detection and semantic segmentation. It is the hierarchical Transformers (e.g., Swin Transformers) that reintroduced several ConvNet priors, making Transformers practically viable as a generic vision backbone and demonstrating remarkable performance on a wide variety of vision tasks. However, the effectiveness of such hybrid approaches is still largely credited to the intrinsic superiority of Transformers, rather than the inherent inductive biases of convolutions. In this work, we reexamine the design spaces and test the limits of what a pure ConvNet can achieve. We gradually "modernize" a standard ResNet toward the design of a vision Transformer, and discover several key components that contribute to the performance difference along the way. The outcome of this exploration is a family of pure ConvNet models dubbed ConvNeXt. Constructed entirely from standard ConvNet modules, ConvNeXts compete favorably with Transformers in terms of accuracy and scalability, achieving 87.8% ImageNet top-1 accuracy and outperforming Swin Transformers on COCO detection and ADE20K segmentation, while maintaining the simplicity and efficiency of standard ConvNets.
Example
import com.johnsnowlabs.nlp.annotator._ import com.johnsnowlabs.nlp.ImageAssembler import org.apache.spark.ml.Pipeline val imageDF: DataFrame = spark.read .format("image") .option("dropInvalid", value = true) .load("src/test/resources/image/") val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ConvNextForImageClassification .pretrained() .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineDF = pipeline.fit(imageDF).transform(imageDF) pipelineDF .selectExpr("reverse(split(image.origin, '/'))[0] as image_name", "class.result") .show(truncate = false) +-----------------+----------------------------------------------------------+ |image_name |result | +-----------------+----------------------------------------------------------+ |palace.JPEG |[palace] | |egyptian_cat.jpeg|[tabby, tabby cat] | |hippopotamus.JPEG|[hippopotamus, hippo, river horse, Hippopotamus amphibius]| |hen.JPEG |[hen] | |ostrich.JPEG |[ostrich, Struthio camelus] | |junco.JPEG |[junco, snowbird] | |bluetick.jpg |[bluetick] | |chihuahua.jpg |[Chihuahua] | |tractor.JPEG |[tractor] | |ox.JPEG |[ox] | +-----------------+----------------------------------------------------------+
-
class
Florence2Transformer extends AnnotatorModel[Florence2Transformer] with HasBatchedAnnotateImage[Florence2Transformer] with HasImageFeatureProperties with WriteOpenvinoModel with HasGeneratorProperties with HasEngine
Florence2: Advancing a Unified Representation for a Variety of Vision Tasks
Florence2: Advancing a Unified Representation for a Variety of Vision Tasks
Florence-2 is an advanced vision foundation model from Microsoft that uses a prompt-based approach to handle a wide range of vision and vision-language tasks. It can interpret simple text prompts to perform tasks like captioning, object detection, segmentation, OCR, and more. The model leverages the FLD-5B dataset, containing 5.4 billion annotations across 126 million images, to master multi-task learning. Its sequence-to-sequence architecture enables it to excel in both zero-shot and fine-tuned settings.
Pretrained and finetuned models can be loaded with
pretrained
of the companion object: {{ { val florence2 = Florence2Transformer.pretrained() .setInputCols("image") .setOutputCol("generation") }} } The default model is"florence2_base_ft_int4"
, if no name is provided.For available pretrained models please see the Models Hub.
Supported Tasks
Florence-2 supports a variety of tasks through prompt engineering. The following prompt tokens can be used:
- <CAPTION>: Image captioning
- <DETAILED_CAPTION>: Detailed image captioning
- <MORE_DETAILED_CAPTION>: Paragraph-level captioning
- <CAPTION_TO_PHRASE_GROUNDING>: Phrase grounding from caption (requires additional text input)
- <OD>: Object detection
- <DENSE_REGION_CAPTION>: Dense region captioning
- <REGION_PROPOSAL>: Region proposal
- <OCR>: Optical Character Recognition (plain text extraction)
- <OCR_WITH_REGION>: OCR with region information
- <REFERRING_EXPRESSION_SEGMENTATION>: Segmentation for a referred phrase (requires additional text input)
- <REGION_TO_SEGMENTATION>: Polygon mask for a region (requires additional text input)
- <OPEN_VOCABULARY_DETECTION>: Open vocabulary detection for a phrase (requires additional text input)
- <REGION_TO_CATEGORY>: Category of a region (requires additional text input)
- <REGION_TO_DESCRIPTION>: Description of a region (requires additional text input)
- <REGION_TO_OCR>: OCR for a region (requires additional text input)
Example Usage
{{ { import com.johnsnowlabs.nlp.base.ImageAssembler import com.johnsnowlabs.nlp.annotators.cv.Florence2Transformer import org.apache.spark.ml.Pipeline
val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler")
val florence2 = Florence2Transformer.pretrained("florence2_base_ft_int4") .setInputCols("image_assembler") .setOutputCol("answer") .setMaxOutputLength(50)
val pipeline = new Pipeline().setStages(Array(imageAssembler, florence2))
val data = Seq("/path/to/image.jpg").toDF("image") val result = pipeline.fit(data).transform(data) result.select("answer.result").show(truncate = false) }} }
References
- Florence-2 technical report: https://arxiv.org/abs/2311.06242
- Hugging Face model card: https://huggingface.co/microsoft/Florence-2-base-ft
- Official sample notebook: https://huggingface.co/microsoft/Florence-2-large/blob/main/sample_inference.ipynb
For more details and advanced usage, see the official documentation and sample notebooks.
-
class
Gemma3ForMultiModal extends AnnotatorModel[Gemma3ForMultiModal] with HasBatchedAnnotateImage[Gemma3ForMultiModal] with HasImageFeatureProperties with WriteOpenvinoModel with HasGeneratorProperties with HasEngine
Gemma3ForMultiModal can load Gemma3 Vision models for visual question answering.
Gemma3ForMultiModal can load Gemma3 Vision models for visual question answering. The model consists of a vision encoder, a text encoder, a text decoder and a model merger. The vision encoder will encode the input image, the text encoder will encode the input text, the model merger will merge the image and text embeddings, and the text decoder will output the answer.
Gemma 3 is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. Key features include:
- Large 128K context window
- Multilingual support in over 140 languages
- Multimodal capabilities handling both text and image inputs
- Optimized for deployment on limited resources (laptops, desktops, cloud)
Pretrained models can be loaded with
pretrained
of the companion object:val visualQA = Gemma3ForMultiModal.pretrained() .setInputCols("image_assembler") .setOutputCol("answer")
The default model is
"gemma3_4b_it_int4"
, if no name is provided.For available pretrained models please see the Models Hub.
Example
import spark.implicits._ import com.johnsnowlabs.nlp.base._ import com.johnsnowlabs.nlp.annotator._ import org.apache.spark.ml.Pipeline val imageDF = spark.read .format("image") .option("dropInvalid", value = true) .load(imageFolder) val testDF = imageDF.withColumn("text", lit("<bos><start_of_turn>user\nYou are a helpful assistant.\n\n<start_of_image>Describe this image in detail.<end_of_turn>\n<start_of_turn>model\n")) val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val visualQA = Gemma3ForMultiModal.pretrained() .setInputCols("image_assembler") .setOutputCol("answer") val pipeline = new Pipeline().setStages(Array( imageAssembler, visualQA )) val result = pipeline.fit(testDF).transform(testDF) result.select("image_assembler.origin", "answer.result").show(truncate = false)
-
trait
HasRescaleFactor extends AnyRef
Enables parameters to handle rescaling for image pre-processors.
-
class
InternVLForMultiModal extends AnnotatorModel[InternVLForMultiModal] with HasBatchedAnnotateImage[InternVLForMultiModal] with HasImageFeatureProperties with WriteOpenvinoModel with HasGeneratorProperties with HasEngine
InternVLForMultiModal can load InternVL Vision models for visual question answering.
InternVLForMultiModal can load InternVL Vision models for visual question answering. The model consists of a vision encoder, a text encoder, a text decoder and a model merger. The vision encoder will encode the input image, the text encoder will encode the input text, the model merger will merge the image and text embeddings, and the text decoder will output the answer.
InternVL 2.5 is an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data quality. Key features include:
- Large context window support
- Multilingual support
- Multimodal capabilities handling both text and image inputs
- Optimized for deployment with int4 quantization
Pretrained models can be loaded with
pretrained
of the companion object:val visualQA = InternVLForMultiModal.pretrained() .setInputCols("image_assembler") .setOutputCol("answer")
The default model is
"internvl2_5_1b_int4"
, if no name is provided.For available pretrained models please see the Models Hub.
Example
import spark.implicits._ import com.johnsnowlabs.nlp.base._ import com.johnsnowlabs.nlp.annotator._ import org.apache.spark.ml.Pipeline val imageDF = spark.read .format("image") .option("dropInvalid", value = true) .load(imageFolder) val testDF = imageDF.withColumn("text", lit("<|im_start|><image>\nDescribe this image in detail.<|im_end|><|im_start|>assistant\n")) val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val visualQA = InternVLForMultiModal.pretrained() .setInputCols("image_assembler") .setOutputCol("answer") val pipeline = new Pipeline().setStages(Array( imageAssembler, visualQA )) val result = pipeline.fit(testDF).transform(testDF) result.select("image_assembler.origin", "answer.result").show(truncate = false)
-
class
JanusForMultiModal extends AnnotatorModel[JanusForMultiModal] with HasBatchedAnnotateImage[JanusForMultiModal] with HasImageFeatureProperties with WriteOpenvinoModel with HasGeneratorProperties with HasEngine
JanusForMultiModal can load Janus models for unified multimodal understanding and generation.
JanusForMultiModal can load Janus models for unified multimodal understanding and generation. The model consists of a vision encoder, a text encoder, and a text decoder. Janus decouples visual encoding for enhanced flexibility, leveraging a unified transformer architecture for both understanding and generation tasks.
Janus uses SigLIP-L as the vision encoder, supporting 384 x 384 image inputs. For image generation, it utilizes a tokenizer with a downsample rate of 16. The framework is based on DeepSeek-LLM-1.3b-base, trained on approximately 500B text tokens.
Pretrained models can be loaded with
pretrained
from the companion object: {{ val visualQA = JanusForMultiModal.pretrained() .setInputCols("image_assembler") .setOutputCol("answer") }} The default model is "janus_1_3b_int4" if no name is provided.For available pretrained models, please refer to the Models Hub.
Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. For compatibility details and import instructions, see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669. For extended examples, refer to https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/test/scala/com/johnsnowlabs/nlp/annotators/cv/JanusForMultiModalTest.scala.
Example
{{ import spark.implicits._
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline
val imageDF: DataFrame = ResourceHelper.spark.read .format("image") .option("dropInvalid", value = true) .load(imageFolder)
val testDF: DataFrame = imageDF.withColumn("text", lit("User: <image_placeholder>Describe image in details Assistant:"))
val imageAssembler: ImageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler")
val visualQAClassifier = JanusForMultiModal.pretrained() .setInputCols("image_assembler") .setOutputCol("answer")
val pipeline = new Pipeline().setStages(Array( imageAssembler, visualQAClassifier ))
val result = pipeline.fit(testDF).transform(testDF)
result.select("image_assembler.origin", "answer.result").show(false)
origin
result
[file:///content/images/cat_image.jpg]
[The unusual aspect of this picture is the presence of two cats lying on a pink couch.]
}}
- See also
CLIPForZeroShotClassification for Zero Shot Image Classification
Annotators Main Page for a list of transformer-based classifiers
-
class
LLAVAForMultiModal extends AnnotatorModel[LLAVAForMultiModal] with HasBatchedAnnotateImage[LLAVAForMultiModal] with HasImageFeatureProperties with WriteOpenvinoModel with HasGeneratorProperties with HasEngine
LLAVAForMultiModal can load LLAVA Vision models for visual question answering.
LLAVAForMultiModal can load LLAVA Vision models for visual question answering. The model consists of a vision encoder, a text encoder as well as a text decoder. The vision encoder will encode the input image, the text encoder will encode the input question together with the encoding of the image, and the text decoder will output the answer to the question.
Pretrained models can be loaded with
pretrained
of the companion object:val visualQA = LLAVAForMultiModal.pretrained() .setInputCols("image_assembler") .setOutputCol("answer")
The default model is
"llava_1_5_7b_hf"
, if no name is provided.For available pretrained models please see the Models Hub.
Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/test/scala/com/johnsnowlabs/nlp/annotators/cv/LLAVAForMultiModalTest.scala.
Example
import spark.implicits._ import com.johnsnowlabs.nlp.base._ import com.johnsnowlabs.nlp.annotator._ import org.apache.spark.ml.Pipeline val imageDF: DataFrame = ResourceHelper.spark.read .format("image") .option("dropInvalid", value = true) .load(imageFolder) val testDF: DataFrame = imageDF.withColumn("text", lit("USER: \n <|image|> \nWhat is unusual on this picture? \n ASSISTANT:\n")) val imageAssembler: ImageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val visualQAClassifier = LLAVAForMultiModal.pretrained() .setInputCols("image_assembler") .setOutputCol("answer") val pipeline = new Pipeline().setStages(Array( imageAssembler, visualQAClassifier )) val result = pipeline.fit(testDF).transform(testDF) result.select("image_assembler.origin", "answer.result").show(false) +--------------------------------------+------+ |origin |result| +--------------------------------------+------+ |[file:///content/images/cat_image.jpg]|[The unusual aspect of this picture is the presence of two cats lying on a pink couch]| +--------------------------------------+------+
- See also
CLIPForZeroShotClassification for Zero Shot Image Classifier
Annotators Main Page for a list of transformer based classifiers
-
class
MLLamaForMultimodal extends AnnotatorModel[MLLamaForMultimodal] with HasBatchedAnnotateImage[MLLamaForMultimodal] with HasImageFeatureProperties with WriteOpenvinoModel with HasGeneratorProperties with HasEngine
MLLamaForMultimodal can load LLAMA 3.2 Vision models for visual question answering.
MLLamaForMultimodal can load LLAMA 3.2 Vision models for visual question answering. The model consists of a vision encoder, a text encoder as well as a text decoder. The vision encoder will encode the input image, the text encoder will encode the input question together with the encoding of the image, and the text decoder will output the answer to the question.
The Llama 3.2-Vision collection of multimodal large language models (LLMs) is a collection of pretrained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text + images in / text out). The Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks.
Pretrained models can be loaded with
pretrained
of the companion object:val visualQA = MLLamaForMultimodal.pretrained() .setInputCols("image_assembler") .setOutputCol("answer")
The default model is
"llama_3_2_11b_vision_instruct_int4"
, if no name is provided.For available pretrained models please see the Models Hub.
Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/test/scala/com/johnsnowlabs/nlp/annotators/cv/MLLamaForMultimodalTest.scala.
Example
import spark.implicits._ import com.johnsnowlabs.nlp.base._ import com.johnsnowlabs.nlp.annotator._ import org.apache.spark.ml.Pipeline val imageDF: DataFrame = ResourceHelper.spark.read .format("image") .option("dropInvalid", value = true) .load(imageFolder) val testDF: DataFrame = imageDF.withColumn("text", lit("<|begin_of_text|><|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n<|image|>What is unusual on this image?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n")) val imageAssembler: ImageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val visualQAClassifier = MLLamaForMultimodal.pretrained() .setInputCols("image_assembler") .setOutputCol("answer") val pipeline = new Pipeline().setStages(Array( imageAssembler, visualQAClassifier )) val result = pipeline.fit(testDF).transform(testDF) result.select("image_assembler.origin", "answer.result").show(false) +--------------------------------------+------+ |origin |result| +--------------------------------------+------+ |[file:///content/images/cat_image.jpg]|[The unusual aspect of this picture is the presence of two cats lying on a pink couch]| +--------------------------------------+------+
- See also
CLIPForZeroShotClassification for Zero Shot Image Classifier
Annotators Main Page for a list of transformer based classifiers
-
class
PaliGemmaForMultiModal extends AnnotatorModel[PaliGemmaForMultiModal] with HasBatchedAnnotateImage[PaliGemmaForMultiModal] with HasImageFeatureProperties with WriteOpenvinoModel with HasGeneratorProperties with HasEngine
PaliGemmaForMultiModal can load PaliGemma Vision models for visual question answering.
PaliGemmaForMultiModal can load PaliGemma Vision models for visual question answering. The model consists of a vision encoder, a text encoder, a text decoder and a model merger. The vision encoder will encode the input image, the text encoder will encode the input text, the model merger will merge the image and text embeddings, and the text decoder will output the answer.
Pretrained models can be loaded with
pretrained
of the companion object:val visualQA = PaliGemmaForMultiModal.pretrained() .setInputCols("image_assembler") .setOutputCol("answer")
The default model is
"paligemma_3b_pt_224_int4"
, if no name is provided.For available pretrained models please see the Models Hub.
Example
import spark.implicits._ import com.johnsnowlabs.nlp.base._ import com.johnsnowlabs.nlp.annotator._ import org.apache.spark.ml.Pipeline val imageDF: DataFrame = ResourceHelper.spark.read .format("image") .option("dropInvalid", value = true) .load(imageFolder) val testDF: DataFrame = imageDF.withColumn("text", lit("USER: \n <|image|> \nWhat is unusual on this picture? \n ASSISTANT:\n")) val imageAssembler: ImageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val visualQAClassifier = PaliGemmaForMultiModal.pretrained() .setInputCols("image_assembler") .setOutputCol("answer") val pipeline = new Pipeline().setStages(Array( imageAssembler, visualQAClassifier )) val result = pipeline.fit(testDF).transform(testDF) result.select("image_assembler.origin", "answer.result").show(false)
-
class
Phi3Vision extends AnnotatorModel[Phi3Vision] with HasBatchedAnnotateImage[Phi3Vision] with HasImageFeatureProperties with WriteOpenvinoModel with HasGeneratorProperties with HasEngine
Phi3Vision can load Phi3 Vision models for visual question answering.
Phi3Vision can load Phi3 Vision models for visual question answering. The model consists of a vision encoder, a text encoder as well as a text decoder. The vision encoder will encode the input image, the text encoder will encode the input question together with the encoding of the image, and the text decoder will output the answer to the question.
Pretrained models can be loaded with
pretrained
of the companion object:val visualQA = Phi3Vision.pretrained() .setInputCols("image_assembler") .setOutputCol("answer")
The default model is
"phi_3_vision_128k_instruct"
, if no name is provided.For available pretrained models please see the Models Hub.
Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/test/scala/com/johnsnowlabs/nlp/annotators/cv/Phi3VisionTest.scala.
Example
import spark.implicits._ import com.johnsnowlabs.nlp.base._ import com.johnsnowlabs.nlp.annotator._ import org.apache.spark.ml.Pipeline val imageDF: DataFrame = ResourceHelper.spark.read .format("image") .option("dropInvalid", value = true) .load(imageFolder) val testDF: DataFrame = imageDF.withColumn("text", lit("<|user|> \n <|image_1|> \nWhat is unusual on this picture? <|end|>\n <|assistant|>\n")) val imageAssembler: ImageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val visualQAClassifier = Phi3Vision.pretrained("phi_3_vision_128k_instruct","en") .setInputCols("image_assembler") .setOutputCol("answer") val pipeline = new Pipeline().setStages(Array( imageAssembler, visualQAClassifier )) val result = pipeline.fit(testDF).transform(testDF) result.select("image_assembler.origin", "answer.result").show(false) +--------------------------------------+------+ |origin |result| +--------------------------------------+------+ |[file:///content/images/cat_image.jpg]|[The unusual aspect of this picture is the presence of two cats lying on a pink couch]| +--------------------------------------+------+
- See also
CLIPForZeroShotClassification for Zero Shot Image Classifier
Annotators Main Page for a list of transformer based classifiers
-
class
Qwen2VLTransformer extends AnnotatorModel[Qwen2VLTransformer] with HasBatchedAnnotateImage[Qwen2VLTransformer] with HasImageFeatureProperties with WriteOpenvinoModel with HasGeneratorProperties with HasEngine
Qwen2VLTransformer can load Qwen2 Vision-Language models for visual question answering and multimodal instruction following.
Qwen2VLTransformer can load Qwen2 Vision-Language models for visual question answering and multimodal instruction following. The model consists of a vision encoder, a text encoder, and a text decoder. The vision encoder processes the input image, the text encoder integrates the encoding of the image with the input text, and the text decoder outputs the response to the query or instruction.
Pretrained models can be loaded with
pretrained
of the companion object:val visualQA = Qwen2VLTransformer.pretrained() .setInputCols("image_assembler") .setOutputCol("answer")
The default model is
"qwen2_vl_2b_instruct_int4"
, if no name is provided.For available pretrained models, please see the Models Hub.
Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are compatible and how to import them, see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669. To explore more extended examples, see https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/test/scala/com/johnsnowlabs/nlp/annotators/cv/Qwen2VLTransformerTest.scala.
Example
import spark.implicits._ import com.johnsnowlabs.nlp.base._ import com.johnsnowlabs.nlp.annotator._ import org.apache.spark.ml.Pipeline val imageDF: DataFrame = ResourceHelper.spark.read .format("image") .option("dropInvalid", value = true) .load(imageFolder) val testDF: DataFrame = imageDF.withColumn("text", lit("<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n")) val imageAssembler: ImageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val visualQAClassifier = Qwen2VLTransformer.pretrained() .setInputCols("image_assembler") .setOutputCol("answer") val pipeline = new Pipeline().setStages(Array( imageAssembler, visualQAClassifier )) val result = pipeline.fit(testDF).transform(testDF) result.select("image_assembler.origin", "answer.result").show(false) +--------------------------------------+------+ |origin |result| +--------------------------------------+------+ |[file:///content/images/cat_image.jpg]|[This image is unusual because it features two cats lying on a pink couch.]| +--------------------------------------+------+
- See also
Annotators Main Page for a list of transformer- based classifiers
- trait ReadBLIPForQuestionAnsweringDLModel extends ReadTensorflowModel
- trait ReadCLIPForZeroShotClassificationModel extends ReadTensorflowModel with ReadOnnxModel with ReadOpenvinoModel
- trait ReadConvNextForImageDLModel extends ReadTensorflowModel with ReadOnnxModel with ReadOpenvinoModel
- trait ReadFlorence2TransformerDLModel extends ReadOpenvinoModel
- trait ReadGemma3ForMultiModalDLModel extends ReadOpenvinoModel
- trait ReadInternVLForMultiModalDLModel extends ReadOpenvinoModel
- trait ReadJanusForMultiModalDLModel extends ReadOpenvinoModel
- trait ReadLLAVAForMultiModalDLModel extends ReadOpenvinoModel
- trait ReadMLLamaForMultimodalDLModel extends ReadOpenvinoModel
- trait ReadPaliGemmaForMultiModalDLModel extends ReadOpenvinoModel
- trait ReadPhi3VisionDLModel extends ReadOpenvinoModel
- trait ReadQwen2VLTransformerDLModel extends ReadOpenvinoModel
- trait ReadSmolVLMTransformerDLModel extends ReadOpenvinoModel
- trait ReadSwinForImageDLModel extends ReadTensorflowModel with ReadOnnxModel with ReadOpenvinoModel
- trait ReadViTForImageDLModel extends ReadTensorflowModel with ReadOnnxModel with ReadOpenvinoModel
- trait ReadVisionEncoderDecoderDLModel extends ReadTensorflowModel with ReadOnnxModel with ReadOpenvinoModel
- trait ReadablePretrainedBLIPForQuestionAnswering extends ParamsAndFeaturesReadable[BLIPForQuestionAnswering] with HasPretrained[BLIPForQuestionAnswering]
- trait ReadablePretrainedCLIPForZeroShotClassificationModel extends ParamsAndFeaturesReadable[CLIPForZeroShotClassification] with HasPretrained[CLIPForZeroShotClassification]
- trait ReadablePretrainedConvNextForImageModel extends ParamsAndFeaturesReadable[ConvNextForImageClassification] with HasPretrained[ConvNextForImageClassification]
- trait ReadablePretrainedFlorence2TransformerModel extends ParamsAndFeaturesReadable[Florence2Transformer] with HasPretrained[Florence2Transformer]
- trait ReadablePretrainedGemma3ForMultiModal extends ParamsAndFeaturesReadable[Gemma3ForMultiModal] with HasPretrained[Gemma3ForMultiModal]
- trait ReadablePretrainedInternVLForMultiModal extends ParamsAndFeaturesReadable[InternVLForMultiModal] with HasPretrained[InternVLForMultiModal]
- trait ReadablePretrainedJanusForMultiModal extends ParamsAndFeaturesReadable[JanusForMultiModal] with HasPretrained[JanusForMultiModal]
- trait ReadablePretrainedLLAVAForMultiModal extends ParamsAndFeaturesReadable[LLAVAForMultiModal] with HasPretrained[LLAVAForMultiModal]
- trait ReadablePretrainedMLLamaForMultimodal extends ParamsAndFeaturesReadable[MLLamaForMultimodal] with HasPretrained[MLLamaForMultimodal]
- trait ReadablePretrainedPaliGemmaForMultiModal extends ParamsAndFeaturesReadable[PaliGemmaForMultiModal] with HasPretrained[PaliGemmaForMultiModal]
- trait ReadablePretrainedPhi3Vision extends ParamsAndFeaturesReadable[Phi3Vision] with HasPretrained[Phi3Vision]
- trait ReadablePretrainedQwen2VLTransformer extends ParamsAndFeaturesReadable[Qwen2VLTransformer] with HasPretrained[Qwen2VLTransformer]
- trait ReadablePretrainedSmolVLMTransformer extends ParamsAndFeaturesReadable[SmolVLMTransformer] with HasPretrained[SmolVLMTransformer]
- trait ReadablePretrainedSwinForImageModel extends ParamsAndFeaturesReadable[SwinForImageClassification] with HasPretrained[SwinForImageClassification]
- trait ReadablePretrainedViTForImageModel extends ParamsAndFeaturesReadable[ViTForImageClassification] with HasPretrained[ViTForImageClassification]
- trait ReadablePretrainedVisionEncoderDecoderModel extends ParamsAndFeaturesReadable[VisionEncoderDecoderForImageCaptioning] with HasPretrained[VisionEncoderDecoderForImageCaptioning]
-
class
SmolVLMTransformer extends AnnotatorModel[SmolVLMTransformer] with HasBatchedAnnotateImage[SmolVLMTransformer] with HasImageFeatureProperties with WriteOpenvinoModel with HasGeneratorProperties with HasEngine
SmolVLMTransformer can load SmolVLM models for visual question answering.
SmolVLMTransformer can load SmolVLM models for visual question answering. The model consists of a vision encoder, a text encoder as well as a text decoder. The vision encoder will encode the input image, the text encoder will encode the input question together with the encoding of the image, and the text decoder will output the answer to the question.
SmolVLM is a compact open multimodal model that accepts arbitrary sequences of image and text inputs to produce text outputs. Designed for efficiency, SmolVLM can answer questions about images, describe visual content, create stories grounded on multiple images, or function as a pure language model without visual inputs. Its lightweight architecture makes it suitable for on-device applications while maintaining strong performance on multimodal tasks.
Pretrained models can be loaded with
pretrained
of the companion object:val visualQA = SmolVLMTransformer.pretrained() .setInputCols("image_assembler") .setOutputCol("answer")
The default model is
"smolvlm_instruct_int4"
, if no name is provided.For available pretrained models please see the Models Hub.
Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/test/scala/com/johnsnowlabs/nlp/annotators/cv/SmolVLMTransformerTest.scala.
Example
import spark.implicits._ import com.johnsnowlabs.nlp.base._ import com.johnsnowlabs.nlp.annotator._ import org.apache.spark.ml.Pipeline val imageDF: DataFrame = ResourceHelper.spark.read .format("image") .option("dropInvalid", value = true) .load(imageFolder) val testDF: DataFrame = imageDF.withColumn("text", lit("<|im_start|>User:<image>Can you describe the image?<end_of_utterance>\nAssistant:")) val imageAssembler: ImageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val visualQAClassifier = SmolVLMTransformer.pretrained() .setInputCols("image_assembler") .setOutputCol("answer") val pipeline = new Pipeline().setStages(Array( imageAssembler, visualQAClassifier )) val result = pipeline.fit(testDF).transform(testDF) result.select("image_assembler.origin", "answer.result").show(false) +--------------------------------------+------+ |origin |result| +--------------------------------------+------+ |[file:///content/images/cat_image.jpg]|[The unusual aspect of this picture is the presence of two cats lying on a pink couch]| +--------------------------------------+------+
- See also
CLIPForZeroShotClassification for Zero Shot Image Classifier
Annotators Main Page for a list of transformer based classifiers
-
class
SwinForImageClassification extends ViTForImageClassification with HasRescaleFactor
SwinImageClassification is an image classifier based on Swin.
SwinImageClassification is an image classifier based on Swin.
The Swin Transformer was proposed in Swin Transformer: Hierarchical Vision Transformer using Shifted Windows by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
It is basically a hierarchical Transformer whose representation is computed with shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection.
Pretrained models can be loaded with
pretrained
of the companion object:val imageClassifier = SwinForImageClassification.pretrained() .setInputCols("image_assembler") .setOutputCol("class")
The default model is
"image_classifier_swin_base_patch4_window7_224"
, if no name is provided.For available pretrained models please see the Models Hub.
Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see SwinForImageClassificationTest.
References:
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Paper Abstract:
This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with Shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test- dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the- art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures.
Example
import com.johnsnowlabs.nlp.annotator._ import com.johnsnowlabs.nlp.ImageAssembler import org.apache.spark.ml.Pipeline val imageDF: DataFrame = spark.read .format("image") .option("dropInvalid", value = true) .load("src/test/resources/image/") val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = SwinForImageClassification .pretrained() .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineDF = pipeline.fit(imageDF).transform(imageDF) pipelineDF .selectExpr("reverse(split(image.origin, '/'))[0] as image_name", "class.result") .show(truncate = false) +-----------------+----------------------------------------------------------+ |image_name |result | +-----------------+----------------------------------------------------------+ |palace.JPEG |[palace] | |egyptian_cat.jpeg|[tabby, tabby cat] | |hippopotamus.JPEG|[hippopotamus, hippo, river horse, Hippopotamus amphibius]| |hen.JPEG |[hen] | |ostrich.JPEG |[ostrich, Struthio camelus] | |junco.JPEG |[junco, snowbird] | |bluetick.jpg |[bluetick] | |chihuahua.jpg |[Chihuahua] | |tractor.JPEG |[tractor] | |ox.JPEG |[ox] | +-----------------+----------------------------------------------------------+
-
class
ViTForImageClassification extends AnnotatorModel[ViTForImageClassification] with HasBatchedAnnotateImage[ViTForImageClassification] with HasImageFeatureProperties with WriteTensorflowModel with WriteOnnxModel with WriteOpenvinoModel with HasEngine
Vision Transformer (ViT) for image classification.
Vision Transformer (ViT) for image classification.
ViT is a transformer based alternative to the convolutional neural networks usually used for image recognition tasks.
Pretrained models can be loaded with
pretrained
of the companion object:val imageClassifier = ViTForImageClassification.pretrained() .setInputCols("image_assembler") .setOutputCol("class")
The default model is
"image_classifier_vit_base_patch16_224"
, if no name is provided.For available pretrained models please see the Models Hub.
Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see ViTImageClassificationTestSpec.
References:
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Paper Abstract:
While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
Example
import com.johnsnowlabs.nlp.annotator._ import com.johnsnowlabs.nlp.ImageAssembler import org.apache.spark.ml.Pipeline val imageDF: DataFrame = spark.read .format("image") .option("dropInvalid", value = true) .load("src/test/resources/image/") val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained() .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineDF = pipeline.fit(imageDF).transform(imageDF) pipelineDF .selectExpr("reverse(split(image.origin, '/'))[0] as image_name", "class.result") .show(truncate = false) +-----------------+----------------------------------------------------------+ |image_name |result | +-----------------+----------------------------------------------------------+ |palace.JPEG |[palace] | |egyptian_cat.jpeg|[Egyptian cat] | |hippopotamus.JPEG|[hippopotamus, hippo, river horse, Hippopotamus amphibius]| |hen.JPEG |[hen] | |ostrich.JPEG |[ostrich, Struthio camelus] | |junco.JPEG |[junco, snowbird] | |bluetick.jpg |[bluetick] | |chihuahua.jpg |[Chihuahua] | |tractor.JPEG |[tractor] | |ox.JPEG |[ox] | +-----------------+----------------------------------------------------------+
-
class
VisionEncoderDecoderForImageCaptioning extends AnnotatorModel[VisionEncoderDecoderForImageCaptioning] with HasBatchedAnnotateImage[VisionEncoderDecoderForImageCaptioning] with HasImageFeatureProperties with WriteTensorflowModel with WriteOnnxModel with WriteOpenvinoModel with HasEngine with HasRescaleFactor with HasGeneratorProperties
VisionEncoderDecoder model that converts images into text captions.
VisionEncoderDecoder model that converts images into text captions. It allows for the use of pretrained vision auto-encoding models, such as ViT, BEiT, or DeiT as the encoder, in combination with pretrained language models, like RoBERTa, GPT2, or BERT as the decoder.
Pretrained models can be loaded with
pretrained
of the companion object:val imageClassifier = VisionEncoderDecoderForImageCaptioning.pretrained() .setInputCols("image_assembler") .setOutputCol("caption")
The default model is
"image_captioning_vit_gpt2"
, if no name is provided.For available pretrained models please see the Models Hub.
Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see VisionEncoderDecoderTestSpec.
Note:
This is a very computationally expensive module especially on larger batch sizes. The use of an accelerator such as GPU is recommended.
Example
import com.johnsnowlabs.nlp.annotator._ import com.johnsnowlabs.nlp.ImageAssembler import org.apache.spark.ml.Pipeline val imageDF: DataFrame = spark.read .format("image") .option("dropInvalid", value = true) .load("src/test/resources/image/") val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageCaptioning = VisionEncoderDecoderForImageCaptioning .pretrained() .setBeamSize(2) .setDoSample(false) .setInputCols("image_assembler") .setOutputCol("caption") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageCaptioning)) val pipelineDF = pipeline.fit(imageDF).transform(imageDF) pipelineDF .selectExpr("reverse(split(image.origin, '/'))[0] as image_name", "caption.result") .show(truncate = false) +-----------------+---------------------------------------------------------+ |image_name |result | +-----------------+---------------------------------------------------------+ |palace.JPEG |[a large room filled with furniture and a large window] | |egyptian_cat.jpeg|[a cat laying on a couch next to another cat] | |hippopotamus.JPEG|[a brown bear in a body of water] | |hen.JPEG |[a flock of chickens standing next to each other] | |ostrich.JPEG |[a large bird standing on top of a lush green field] | |junco.JPEG |[a small bird standing on a wet ground] | |bluetick.jpg |[a small dog standing on a wooden floor] | |chihuahua.jpg |[a small brown dog wearing a blue sweater] | |tractor.JPEG |[a man is standing in a field with a tractor] | |ox.JPEG |[a large brown cow standing on top of a lush green field]| +-----------------+---------------------------------------------------------+
Value Members
- object BLIPForQuestionAnswering extends ReadablePretrainedBLIPForQuestionAnswering with ReadBLIPForQuestionAnsweringDLModel with Serializable
-
object
CLIPForZeroShotClassification extends ReadablePretrainedCLIPForZeroShotClassificationModel with ReadCLIPForZeroShotClassificationModel with Serializable
This is the companion object of CLIPForZeroShotClassification.
This is the companion object of CLIPForZeroShotClassification. Please refer to that class for the documentation.
-
object
ConvNextForImageClassification extends ReadablePretrainedConvNextForImageModel with ReadConvNextForImageDLModel with Serializable
This is the companion object of ConvNextForImageClassification.
This is the companion object of ConvNextForImageClassification. Please refer to that class for the documentation.
- object Florence2Transformer extends ReadablePretrainedFlorence2TransformerModel with ReadFlorence2TransformerDLModel with Serializable
- object Gemma3ForMultiModal extends ReadablePretrainedGemma3ForMultiModal with ReadGemma3ForMultiModalDLModel with Serializable
- object InternVLForMultiModal extends ReadablePretrainedInternVLForMultiModal with ReadInternVLForMultiModalDLModel with Serializable
- object JanusForMultiModal extends ReadablePretrainedJanusForMultiModal with ReadJanusForMultiModalDLModel with Serializable
- object LLAVAForMultiModal extends ReadablePretrainedLLAVAForMultiModal with ReadLLAVAForMultiModalDLModel with Serializable
- object MLLamaForMultimodal extends ReadablePretrainedMLLamaForMultimodal with ReadMLLamaForMultimodalDLModel with Serializable
- object PaliGemmaForMultiModal extends ReadablePretrainedPaliGemmaForMultiModal with ReadPaliGemmaForMultiModalDLModel with Serializable
- object Phi3Vision extends ReadablePretrainedPhi3Vision with ReadPhi3VisionDLModel with Serializable
- object Qwen2VLTransformer extends ReadablePretrainedQwen2VLTransformer with ReadQwen2VLTransformerDLModel with Serializable
- object SmolVLMTransformer extends ReadablePretrainedSmolVLMTransformer with ReadSmolVLMTransformerDLModel with Serializable
-
object
SwinForImageClassification extends ReadablePretrainedSwinForImageModel with ReadSwinForImageDLModel with Serializable
This is the companion object of SwinForImageClassification.
This is the companion object of SwinForImageClassification. Please refer to that class for the documentation.
-
object
ViTForImageClassification extends ReadablePretrainedViTForImageModel with ReadViTForImageDLModel with Serializable
This is the companion object of ViTForImageClassification.
This is the companion object of ViTForImageClassification. Please refer to that class for the documentation.
-
object
VisionEncoderDecoderForImageCaptioning extends ReadablePretrainedVisionEncoderDecoderModel with ReadVisionEncoderDecoderDLModel with Serializable
This is the companion object of VisionEncoderDecoderForImageCaptioning.
This is the companion object of VisionEncoderDecoderForImageCaptioning. Please refer to that class for the documentation.