cv

package cv

Ordering

Alphabetic

Visibility

Public
All

Type Members

class BLIPForQuestionAnswering extends AnnotatorModel[BLIPForQuestionAnswering] with HasBatchedAnnotateImage[BLIPForQuestionAnswering] with HasImageFeatureProperties with WriteTensorflowModel with HasEngine
BLIPForQuestionAnswering can load BLIP models for visual question answering.
BLIPForQuestionAnswering can load BLIP models for visual question answering. The model consists of a vision encoder, a text encoder as well as a text decoder. The vision encoder will encode the input image, the text encoder will encode the input question together with the encoding of the image, and the text decoder will output the answer to the question.
Pretrained models can be loaded with pretrained of the companion object:
```
val visualQAClassifier = BLIPForQuestionAnswering.pretrained()
  .setInputCols("image_assembler")
  .setOutputCol("answer")
```
The default model is "blip_vqa_base", if no name is provided.
For available pretrained models please see the Models Hub.
Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/test/scala/com/johnsnowlabs/nlp/annotators/cv/BLIPForQuestionAnsweringTest.scala.
Example
```
import spark.implicits._
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline

val imageDF: DataFrame = ResourceHelper.spark.read
 .format("image")
 .option("dropInvalid", value = true)
 .load(imageFolder)

val testDF: DataFrame = imageDF.withColumn("text", lit("What's this picture about?"))

val imageAssembler: ImageAssembler = new ImageAssembler()
  .setInputCol("image")
  .setOutputCol("image_assembler")

val visualQAClassifier = BLIPForQuestionAnswering.pretrained()
  .setInputCols("image_assembler")
  .setOutputCol("answer")

val pipeline = new Pipeline().setStages(Array(
  imageAssembler,
  visualQAClassifier
))

val result = pipeline.fit(testDF).transform(testDF)

result.select("image_assembler.origin", "answer.result").show(false)
+--------------------------------------+------+
|origin                                |result|
+--------------------------------------+------+
|[file:///content/images/cat_image.jpg]|[cats]|
+--------------------------------------+------+
```
See also
CLIPForZeroShotClassification for Zero Shot Image Classifier
Annotators Main Page for a list of transformer based classifiers

class CLIPForZeroShotClassification extends AnnotatorModel[CLIPForZeroShotClassification] with HasBatchedAnnotateImage[CLIPForZeroShotClassification] with HasImageFeatureProperties with WriteTensorflowModel with WriteOnnxModel with WriteOpenvinoModel with HasEngine with HasRescaleFactor

Zero Shot Image Classifier based on CLIP.

CLIP (Contrastive Language-Image Pre-Training) is a neural network that was trained on image and text pairs. It has the ability to predict images without training on any hard-coded labels. This makes it very flexible, as labels can be provided during inference. This is similar to the zero-shot capabilities of the GPT-2 and 3 models.

Pretrained models can be loaded with pretrained of the companion object:

val imageClassifier = CLIPForZeroShotClassification.pretrained()
  .setInputCols("image_assembler")
  .setOutputCol("label")

The default model is "zero_shot_classifier_clip_vit_base_patch32", if no name is provided.

For available pretrained models please see the Models Hub.

Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see CLIPForZeroShotClassificationTestSpec.

Example

import com.johnsnowlabs.nlp.ImageAssembler
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline

val imageDF = ResourceHelper.spark.read
  .format("image")
  .option("dropInvalid", value = true)
  .load("src/test/resources/image/")

val imageAssembler: ImageAssembler = new ImageAssembler()
  .setInputCol("image")
  .setOutputCol("image_assembler")

val candidateLabels = Array(
  "a photo of a bird",
  "a photo of a cat",
  "a photo of a dog",
  "a photo of a hen",
  "a photo of a hippo",
  "a photo of a room",
  "a photo of a tractor",
  "a photo of an ostrich",
  "a photo of an ox")

val imageClassifier = CLIPForZeroShotClassification
  .pretrained()
  .setInputCols("image_assembler")
  .setOutputCol("label")
  .setCandidateLabels(candidateLabels)

val pipeline =
  new Pipeline().setStages(Array(imageAssembler, imageClassifier)).fit(imageDF).transform(imageDF)

pipeline
  .selectExpr("reverse(split(image.origin, '/'))[0] as image_name", "label.result")
  .show(truncate = false)
+-----------------+-----------------------+
|image_name       |result                 |
+-----------------+-----------------------+
|palace.JPEG      |[a photo of a room]    |
|egyptian_cat.jpeg|[a photo of a cat]     |
|hippopotamus.JPEG|[a photo of a hippo]   |
|hen.JPEG         |[a photo of a hen]     |
|ostrich.JPEG     |[a photo of an ostrich]|
|junco.JPEG       |[a photo of a bird]    |
|bluetick.jpg     |[a photo of a dog]     |
|chihuahua.jpg    |[a photo of a dog]     |
|tractor.JPEG     |[a photo of a tractor] |
|ox.JPEG          |[a photo of an ox]     |
+-----------------+-----------------------+

class ConvNextForImageClassification extends SwinForImageClassification
ConvNextForImageClassification is an image classifier based on ConvNet models.
ConvNextForImageClassification is an image classifier based on ConvNet models.
The ConvNeXT model was proposed in A ConvNet for the 2020s by Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie. ConvNeXT is a pure convolutional model (ConvNet), inspired by the design of Vision Transformers, that claims to outperform them.
Pretrained models can be loaded with pretrained of the companion object:
```
val imageClassifier = ConvNextForImageClassification.pretrained()
  .setInputCols("image_assembler")
  .setOutputCol("class")
```
The default model is "image_classifier_convnext_tiny_224_local", if no name is provided.
For available pretrained models please see the Models Hub.
Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see ConvNextForImageClassificationTestSpec.
References:
A ConvNet for the 2020s
Paper Abstract:
The "Roaring 20s" of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly superseded ConvNets as the state-of-the-art image classification model. A vanilla ViT, on the other hand, faces difficulties when applied to general computer vision tasks such as object detection and semantic segmentation. It is the hierarchical Transformers (e.g., Swin Transformers) that reintroduced several ConvNet priors, making Transformers practically viable as a generic vision backbone and demonstrating remarkable performance on a wide variety of vision tasks. However, the effectiveness of such hybrid approaches is still largely credited to the intrinsic superiority of Transformers, rather than the inherent inductive biases of convolutions. In this work, we reexamine the design spaces and test the limits of what a pure ConvNet can achieve. We gradually "modernize" a standard ResNet toward the design of a vision Transformer, and discover several key components that contribute to the performance difference along the way. The outcome of this exploration is a family of pure ConvNet models dubbed ConvNeXt. Constructed entirely from standard ConvNet modules, ConvNeXts compete favorably with Transformers in terms of accuracy and scalability, achieving 87.8% ImageNet top-1 accuracy and outperforming Swin Transformers on COCO detection and ADE20K segmentation, while maintaining the simplicity and efficiency of standard ConvNets.
Example
```
import com.johnsnowlabs.nlp.annotator._
import com.johnsnowlabs.nlp.ImageAssembler
import org.apache.spark.ml.Pipeline

val imageDF: DataFrame = spark.read
  .format("image")
  .option("dropInvalid", value = true)
  .load("src/test/resources/image/")

val imageAssembler = new ImageAssembler()
  .setInputCol("image")
  .setOutputCol("image_assembler")

val imageClassifier = ConvNextForImageClassification
  .pretrained()
  .setInputCols("image_assembler")
  .setOutputCol("class")

val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineDF = pipeline.fit(imageDF).transform(imageDF)

pipelineDF
  .selectExpr("reverse(split(image.origin, '/'))[0] as image_name", "class.result")
  .show(truncate = false)
+-----------------+----------------------------------------------------------+
|image_name       |result                                                    |
+-----------------+----------------------------------------------------------+
|palace.JPEG      |[palace]                                                  |
|egyptian_cat.jpeg|[tabby, tabby cat]                                        |
|hippopotamus.JPEG|[hippopotamus, hippo, river horse, Hippopotamus amphibius]|
|hen.JPEG         |[hen]                                                     |
|ostrich.JPEG     |[ostrich, Struthio camelus]                               |
|junco.JPEG       |[junco, snowbird]                                         |
|bluetick.jpg     |[bluetick]                                                |
|chihuahua.jpg    |[Chihuahua]                                               |
|tractor.JPEG     |[tractor]                                                 |
|ox.JPEG          |[ox]                                                      |
+-----------------+----------------------------------------------------------+
```
class Florence2Transformer extends AnnotatorModel[Florence2Transformer] with HasBatchedAnnotateImage[Florence2Transformer] with HasImageFeatureProperties with WriteOpenvinoModel with HasGeneratorProperties with HasEngine
Florence2: Advancing a Unified Representation for a Variety of Vision Tasks
Florence2: Advancing a Unified Representation for a Variety of Vision Tasks
Florence-2 is an advanced vision foundation model from Microsoft that uses a prompt-based approach to handle a wide range of vision and vision-language tasks. It can interpret simple text prompts to perform tasks like captioning, object detection, segmentation, OCR, and more. The model leverages the FLD-5B dataset, containing 5.4 billion annotations across 126 million images, to master multi-task learning. Its sequence-to-sequence architecture enables it to excel in both zero-shot and fine-tuned settings.
Pretrained and finetuned models can be loaded with pretrained of the companion object: {{ { val florence2 = Florence2Transformer.pretrained() .setInputCols("image") .setOutputCol("generation") }} } The default model is "florence2_base_ft_int4", if no name is provided.
For available pretrained models please see the Models Hub.
Supported Tasks
Florence-2 supports a variety of tasks through prompt engineering. The following prompt tokens can be used:
- <CAPTION>: Image captioning
- <DETAILED_CAPTION>: Detailed image captioning
- <MORE_DETAILED_CAPTION>: Paragraph-level captioning
- <CAPTION_TO_PHRASE_GROUNDING>: Phrase grounding from caption (requires additional text input)
- <OD>: Object detection
- <DENSE_REGION_CAPTION>: Dense region captioning
- <REGION_PROPOSAL>: Region proposal
- <OCR>: Optical Character Recognition (plain text extraction)
- <OCR_WITH_REGION>: OCR with region information
- <REFERRING_EXPRESSION_SEGMENTATION>: Segmentation for a referred phrase (requires additional text input)
- <REGION_TO_SEGMENTATION>: Polygon mask for a region (requires additional text input)
- <OPEN_VOCABULARY_DETECTION>: Open vocabulary detection for a phrase (requires additional text input)
- <REGION_TO_CATEGORY>: Category of a region (requires additional text input)
- <REGION_TO_DESCRIPTION>: Description of a region (requires additional text input)
- <REGION_TO_OCR>: OCR for a region (requires additional text input)
Example Usage
{{ { import com.johnsnowlabs.nlp.base.ImageAssembler import com.johnsnowlabs.nlp.annotators.cv.Florence2Transformer import org.apache.spark.ml.Pipeline
val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler")
val florence2 = Florence2Transformer.pretrained("florence2_base_ft_int4") .setInputCols("image_assembler") .setOutputCol("answer") .setMaxOutputLength(50)
val pipeline = new Pipeline().setStages(Array(imageAssembler, florence2))
val data = Seq("/path/to/image.jpg").toDF("image") val result = pipeline.fit(data).transform(data) result.select("answer.result").show(truncate = false) }} }
References
- Florence-2 technical report: https://arxiv.org/abs/2311.06242
- Hugging Face model card: https://huggingface.co/microsoft/Florence-2-base-ft
- Official sample notebook: https://huggingface.co/microsoft/Florence-2-large/blob/main/sample_inference.ipynb
For more details and advanced usage, see the official documentation and sample notebooks.
class Gemma3ForMultiModal extends AnnotatorModel[Gemma3ForMultiModal] with HasBatchedAnnotateImage[Gemma3ForMultiModal] with HasImageFeatureProperties with WriteOpenvinoModel with HasGeneratorProperties with HasEngine
Gemma3ForMultiModal can load Gemma3 Vision models for visual question answering.
Gemma3ForMultiModal can load Gemma3 Vision models for visual question answering. The model consists of a vision encoder, a text encoder, a text decoder and a model merger. The vision encoder will encode the input image, the text encoder will encode the input text, the model merger will merge the image and text embeddings, and the text decoder will output the answer.
Gemma 3 is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. Key features include:
- Large 128K context window
- Multilingual support in over 140 languages
- Multimodal capabilities handling both text and image inputs
- Optimized for deployment on limited resources (laptops, desktops, cloud)
Pretrained models can be loaded with pretrained of the companion object:
```
val visualQA = Gemma3ForMultiModal.pretrained()
  .setInputCols("image_assembler")
  .setOutputCol("answer")
```
The default model is "gemma3_4b_it_int4", if no name is provided.
For available pretrained models please see the Models Hub.
Example
```
import spark.implicits._
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline

val imageDF = spark.read
  .format("image")
  .option("dropInvalid", value = true)
  .load(imageFolder)

val testDF = imageDF.withColumn("text", lit("<bos><start_of_turn>user\nYou are a helpful assistant.\n\n<start_of_image>Describe this image in detail.<end_of_turn>\n<start_of_turn>model\n"))

val imageAssembler = new ImageAssembler()
  .setInputCol("image")
  .setOutputCol("image_assembler")

val visualQA = Gemma3ForMultiModal.pretrained()
  .setInputCols("image_assembler")
  .setOutputCol("answer")

val pipeline = new Pipeline().setStages(Array(
  imageAssembler,
  visualQA
))

val result = pipeline.fit(testDF).transform(testDF)

result.select("image_assembler.origin", "answer.result").show(truncate = false)
```
trait HasRescaleFactor extends AnyRef
Enables parameters to handle rescaling for image pre-processors.
class InternVLForMultiModal extends AnnotatorModel[InternVLForMultiModal] with HasBatchedAnnotateImage[InternVLForMultiModal] with HasImageFeatureProperties with WriteOpenvinoModel with HasGeneratorProperties with HasEngine
InternVLForMultiModal can load InternVL Vision models for visual question answering.
InternVLForMultiModal can load InternVL Vision models for visual question answering. The model consists of a vision encoder, a text encoder, a text decoder and a model merger. The vision encoder will encode the input image, the text encoder will encode the input text, the model merger will merge the image and text embeddings, and the text decoder will output the answer.
InternVL 2.5 is an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data quality. Key features include:
- Large context window support
- Multilingual support
- Multimodal capabilities handling both text and image inputs
- Optimized for deployment with int4 quantization
Pretrained models can be loaded with pretrained of the companion object:
```
val visualQA = InternVLForMultiModal.pretrained()
  .setInputCols("image_assembler")
  .setOutputCol("answer")
```
The default model is "internvl2_5_1b_int4", if no name is provided.
For available pretrained models please see the Models Hub.
Example
```
import spark.implicits._
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline

val imageDF = spark.read
  .format("image")
  .option("dropInvalid", value = true)
  .load(imageFolder)

val testDF = imageDF.withColumn("text", lit("<|im_start|><image>\nDescribe this image in detail.<|im_end|><|im_start|>assistant\n"))

val imageAssembler = new ImageAssembler()
  .setInputCol("image")
  .setOutputCol("image_assembler")

val visualQA = InternVLForMultiModal.pretrained()
  .setInputCols("image_assembler")
  .setOutputCol("answer")

val pipeline = new Pipeline().setStages(Array(
  imageAssembler,
  visualQA
))

val result = pipeline.fit(testDF).transform(testDF)

result.select("image_assembler.origin", "answer.result").show(truncate = false)
```

class JanusForMultiModal extends AnnotatorModel[JanusForMultiModal] with HasBatchedAnnotateImage[JanusForMultiModal] with HasImageFeatureProperties with WriteOpenvinoModel with HasGeneratorProperties with HasEngine

JanusForMultiModal can load Janus models for unified multimodal understanding and generation.

JanusForMultiModal can load Janus models for unified multimodal understanding and generation. The model consists of a vision encoder, a text encoder, and a text decoder. Janus decouples visual encoding for enhanced flexibility, leveraging a unified transformer architecture for both understanding and generation tasks.

Janus uses SigLIP-L as the vision encoder, supporting 384 x 384 image inputs. For image generation, it utilizes a tokenizer with a downsample rate of 16. The framework is based on DeepSeek-LLM-1.3b-base, trained on approximately 500B text tokens.

Pretrained models can be loaded with pretrained from the companion object: {{ val visualQA = JanusForMultiModal.pretrained() .setInputCols("image_assembler") .setOutputCol("answer") }} The default model is "janus_1_3b_int4" if no name is provided.

For available pretrained models, please refer to the Models Hub.

Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. For compatibility details and import instructions, see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669. For extended examples, refer to https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/test/scala/com/johnsnowlabs/nlp/annotators/cv/JanusForMultiModalTest.scala.

Example

{{ import spark.implicits._

import com.johnsnowlabs.nlp.base._

import com.johnsnowlabs.nlp.annotator._

import org.apache.spark.ml.Pipeline

val imageDF: DataFrame = ResourceHelper.spark.read .format("image") .option("dropInvalid", value = true) .load(imageFolder)

val testDF: DataFrame = imageDF.withColumn("text", lit("User: <image_placeholder>Describe image in details Assistant:"))

val imageAssembler: ImageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler")

val visualQAClassifier = JanusForMultiModal.pretrained() .setInputCols("image_assembler") .setOutputCol("answer")

val pipeline = new Pipeline().setStages(Array( imageAssembler, visualQAClassifier ))

val result = pipeline.fit(testDF).transform(testDF)

result.select("image_assembler.origin", "answer.result").show(false)

origin	result
[file:///content/images/cat_image.jpg]	[The unusual aspect of this picture is the presence of two cats lying on a pink couch.]

}}

See also: CLIPForZeroShotClassification for Zero Shot Image Classification
Annotators Main Page for a list of transformer-based classifiers

class LLAVAForMultiModal extends AnnotatorModel[LLAVAForMultiModal] with HasBatchedAnnotateImage[LLAVAForMultiModal] with HasImageFeatureProperties with WriteOpenvinoModel with HasGeneratorProperties with HasEngine
LLAVAForMultiModal can load LLAVA Vision models for visual question answering.
LLAVAForMultiModal can load LLAVA Vision models for visual question answering. The model consists of a vision encoder, a text encoder as well as a text decoder. The vision encoder will encode the input image, the text encoder will encode the input question together with the encoding of the image, and the text decoder will output the answer to the question.
Pretrained models can be loaded with pretrained of the companion object:
```
val visualQA = LLAVAForMultiModal.pretrained()
  .setInputCols("image_assembler")
  .setOutputCol("answer")
```
The default model is "llava_1_5_7b_hf", if no name is provided.
For available pretrained models please see the Models Hub.
Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/test/scala/com/johnsnowlabs/nlp/annotators/cv/LLAVAForMultiModalTest.scala.
Example
```
import spark.implicits._
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline

val imageDF: DataFrame = ResourceHelper.spark.read
 .format("image")
 .option("dropInvalid", value = true)
 .load(imageFolder)

val testDF: DataFrame = imageDF.withColumn("text", lit("USER: \n <|image|> \nWhat is unusual on this picture? \n ASSISTANT:\n"))

val imageAssembler: ImageAssembler = new ImageAssembler()
  .setInputCol("image")
  .setOutputCol("image_assembler")

val visualQAClassifier = LLAVAForMultiModal.pretrained()
  .setInputCols("image_assembler")
  .setOutputCol("answer")

val pipeline = new Pipeline().setStages(Array(
  imageAssembler,
  visualQAClassifier
))

val result = pipeline.fit(testDF).transform(testDF)

result.select("image_assembler.origin", "answer.result").show(false)
+--------------------------------------+------+
|origin                                |result|
+--------------------------------------+------+
|[file:///content/images/cat_image.jpg]|[The unusual aspect of this picture is the presence of two cats lying on a pink couch]|
+--------------------------------------+------+
```
See also
CLIPForZeroShotClassification for Zero Shot Image Classifier
Annotators Main Page for a list of transformer based classifiers
class MLLamaForMultimodal extends AnnotatorModel[MLLamaForMultimodal] with HasBatchedAnnotateImage[MLLamaForMultimodal] with HasImageFeatureProperties with WriteOpenvinoModel with HasGeneratorProperties with HasEngine
MLLamaForMultimodal can load LLAMA 3.2 Vision models for visual question answering.
MLLamaForMultimodal can load LLAMA 3.2 Vision models for visual question answering. The model consists of a vision encoder, a text encoder as well as a text decoder. The vision encoder will encode the input image, the text encoder will encode the input question together with the encoding of the image, and the text decoder will output the answer to the question.
The Llama 3.2-Vision collection of multimodal large language models (LLMs) is a collection of pretrained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text + images in / text out). The Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks.
Pretrained models can be loaded with pretrained of the companion object:
```
val visualQA = MLLamaForMultimodal.pretrained()
  .setInputCols("image_assembler")
  .setOutputCol("answer")
```
The default model is "llama_3_2_11b_vision_instruct_int4", if no name is provided.
For available pretrained models please see the Models Hub.
Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/test/scala/com/johnsnowlabs/nlp/annotators/cv/MLLamaForMultimodalTest.scala.
Example
```
import spark.implicits._
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline

val imageDF: DataFrame = ResourceHelper.spark.read
 .format("image")
 .option("dropInvalid", value = true)
 .load(imageFolder)

val testDF: DataFrame = imageDF.withColumn("text", lit("<|begin_of_text|><|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n<|image|>What is unusual on this image?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"))

val imageAssembler: ImageAssembler = new ImageAssembler()
  .setInputCol("image")
  .setOutputCol("image_assembler")

val visualQAClassifier = MLLamaForMultimodal.pretrained()
  .setInputCols("image_assembler")
  .setOutputCol("answer")

val pipeline = new Pipeline().setStages(Array(
  imageAssembler,
  visualQAClassifier
))

val result = pipeline.fit(testDF).transform(testDF)

result.select("image_assembler.origin", "answer.result").show(false)
+--------------------------------------+------+
|origin                                |result|
+--------------------------------------+------+
|[file:///content/images/cat_image.jpg]|[The unusual aspect of this picture is the presence of two cats lying on a pink couch]|
+--------------------------------------+------+
```
See also
CLIPForZeroShotClassification for Zero Shot Image Classifier
Annotators Main Page for a list of transformer based classifiers

class PaliGemmaForMultiModal extends AnnotatorModel[PaliGemmaForMultiModal] with HasBatchedAnnotateImage[PaliGemmaForMultiModal] with HasImageFeatureProperties with WriteOpenvinoModel with HasGeneratorProperties with HasEngine

PaliGemmaForMultiModal can load PaliGemma Vision models for visual question answering.

PaliGemmaForMultiModal can load PaliGemma Vision models for visual question answering. The model consists of a vision encoder, a text encoder, a text decoder and a model merger. The vision encoder will encode the input image, the text encoder will encode the input text, the model merger will merge the image and text embeddings, and the text decoder will output the answer.

Pretrained models can be loaded with pretrained of the companion object:

val visualQA = PaliGemmaForMultiModal.pretrained()
  .setInputCols("image_assembler")
  .setOutputCol("answer")

The default model is "paligemma_3b_pt_224_int4", if no name is provided.

For available pretrained models please see the Models Hub.

Example

import spark.implicits._
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline

val imageDF: DataFrame = ResourceHelper.spark.read
 .format("image")
 .option("dropInvalid", value = true)
 .load(imageFolder)

val testDF: DataFrame = imageDF.withColumn("text", lit("USER: \n <|image|> \nWhat is unusual on this picture? \n ASSISTANT:\n"))

val imageAssembler: ImageAssembler = new ImageAssembler()
  .setInputCol("image")
  .setOutputCol("image_assembler")

val visualQAClassifier = PaliGemmaForMultiModal.pretrained()
  .setInputCols("image_assembler")
  .setOutputCol("answer")

val pipeline = new Pipeline().setStages(Array(
  imageAssembler,
  visualQAClassifier
))

val result = pipeline.fit(testDF).transform(testDF)

result.select("image_assembler.origin", "answer.result").show(false)

class Phi3Vision extends AnnotatorModel[Phi3Vision] with HasBatchedAnnotateImage[Phi3Vision] with HasImageFeatureProperties with WriteOpenvinoModel with HasGeneratorProperties with HasEngine

Phi3Vision can load Phi3 Vision models for visual question answering.

Phi3Vision can load Phi3 Vision models for visual question answering. The model consists of a vision encoder, a text encoder as well as a text decoder. The vision encoder will encode the input image, the text encoder will encode the input question together with the encoding of the image, and the text decoder will output the answer to the question.

Pretrained models can be loaded with pretrained of the companion object:

val visualQA = Phi3Vision.pretrained()
  .setInputCols("image_assembler")
  .setOutputCol("answer")

The default model is "phi_3_vision_128k_instruct", if no name is provided.

For available pretrained models please see the Models Hub.

Example

import spark.implicits._
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline

val imageDF: DataFrame = ResourceHelper.spark.read
 .format("image")
 .option("dropInvalid", value = true)
 .load(imageFolder)

val testDF: DataFrame = imageDF.withColumn("text", lit("<|user|> \n <|image_1|> \nWhat is unusual on this picture? <|end|>\n <|assistant|>\n"))

val imageAssembler: ImageAssembler = new ImageAssembler()
  .setInputCol("image")
  .setOutputCol("image_assembler")

val visualQAClassifier = Phi3Vision.pretrained("phi_3_vision_128k_instruct","en")
  .setInputCols("image_assembler")
  .setOutputCol("answer")

val pipeline = new Pipeline().setStages(Array(
  imageAssembler,
  visualQAClassifier
))

val result = pipeline.fit(testDF).transform(testDF)

result.select("image_assembler.origin", "answer.result").show(false)
+--------------------------------------+------+
|origin                                |result|
+--------------------------------------+------+
|[file:///content/images/cat_image.jpg]|[The unusual aspect of this picture is the presence of two cats lying on a pink couch]|
+--------------------------------------+------+

See also: CLIPForZeroShotClassification for Zero Shot Image Classifier
Annotators Main Page for a list of transformer based classifiers

class Qwen2VLTransformer extends AnnotatorModel[Qwen2VLTransformer] with HasBatchedAnnotateImage[Qwen2VLTransformer] with HasImageFeatureProperties with WriteOpenvinoModel with HasGeneratorProperties with HasEngine

Qwen2VLTransformer can load Qwen2 Vision-Language models for visual question answering and multimodal instruction following.

Qwen2VLTransformer can load Qwen2 Vision-Language models for visual question answering and multimodal instruction following. The model consists of a vision encoder, a text encoder, and a text decoder. The vision encoder processes the input image, the text encoder integrates the encoding of the image with the input text, and the text decoder outputs the response to the query or instruction.

Pretrained models can be loaded with pretrained of the companion object:

val visualQA = Qwen2VLTransformer.pretrained()
  .setInputCols("image_assembler")
  .setOutputCol("answer")

The default model is "qwen2_vl_2b_instruct_int4", if no name is provided.

For available pretrained models, please see the Models Hub.

Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are compatible and how to import them, see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669. To explore more extended examples, see https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/test/scala/com/johnsnowlabs/nlp/annotators/cv/Qwen2VLTransformerTest.scala.

Example

import spark.implicits._
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline

val imageDF: DataFrame = ResourceHelper.spark.read
  .format("image")
  .option("dropInvalid", value = true)
  .load(imageFolder)

val testDF: DataFrame = imageDF.withColumn("text", lit("<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n"))

val imageAssembler: ImageAssembler = new ImageAssembler()
  .setInputCol("image")
  .setOutputCol("image_assembler")

val visualQAClassifier = Qwen2VLTransformer.pretrained()
  .setInputCols("image_assembler")
  .setOutputCol("answer")

val pipeline = new Pipeline().setStages(Array(
  imageAssembler,
  visualQAClassifier
))

val result = pipeline.fit(testDF).transform(testDF)

result.select("image_assembler.origin", "answer.result").show(false)
+--------------------------------------+------+
|origin                                |result|
+--------------------------------------+------+
|[file:///content/images/cat_image.jpg]|[This image is unusual because it features two cats lying on a pink couch.]|
+--------------------------------------+------+

See also: Annotators Main Page for a list of transformer- based classifiers

trait ReadBLIPForQuestionAnsweringDLModel extends ReadTensorflowModel
trait ReadCLIPForZeroShotClassificationModel extends ReadTensorflowModel with ReadOnnxModel with ReadOpenvinoModel
trait ReadConvNextForImageDLModel extends ReadTensorflowModel with ReadOnnxModel with ReadOpenvinoModel
trait ReadFlorence2TransformerDLModel extends ReadOpenvinoModel
trait ReadGemma3ForMultiModalDLModel extends ReadOpenvinoModel
trait ReadInternVLForMultiModalDLModel extends ReadOpenvinoModel
trait ReadJanusForMultiModalDLModel extends ReadOpenvinoModel
trait ReadLLAVAForMultiModalDLModel extends ReadOpenvinoModel
trait ReadMLLamaForMultimodalDLModel extends ReadOpenvinoModel
trait ReadPaliGemmaForMultiModalDLModel extends ReadOpenvinoModel
trait ReadPhi3VisionDLModel extends ReadOpenvinoModel
trait ReadQwen2VLTransformerDLModel extends ReadOpenvinoModel
trait ReadSmolVLMTransformerDLModel extends ReadOpenvinoModel
trait ReadSwinForImageDLModel extends ReadTensorflowModel with ReadOnnxModel with ReadOpenvinoModel
trait ReadViTForImageDLModel extends ReadTensorflowModel with ReadOnnxModel with ReadOpenvinoModel
trait ReadVisionEncoderDecoderDLModel extends ReadTensorflowModel with ReadOnnxModel with ReadOpenvinoModel
trait ReadablePretrainedBLIPForQuestionAnswering extends ParamsAndFeaturesReadable[BLIPForQuestionAnswering] with HasPretrained[BLIPForQuestionAnswering]
trait ReadablePretrainedCLIPForZeroShotClassificationModel extends ParamsAndFeaturesReadable[CLIPForZeroShotClassification] with HasPretrained[CLIPForZeroShotClassification]
trait ReadablePretrainedConvNextForImageModel extends ParamsAndFeaturesReadable[ConvNextForImageClassification] with HasPretrained[ConvNextForImageClassification]
trait ReadablePretrainedFlorence2TransformerModel extends ParamsAndFeaturesReadable[Florence2Transformer] with HasPretrained[Florence2Transformer]
trait ReadablePretrainedGemma3ForMultiModal extends ParamsAndFeaturesReadable[Gemma3ForMultiModal] with HasPretrained[Gemma3ForMultiModal]
trait ReadablePretrainedInternVLForMultiModal extends ParamsAndFeaturesReadable[InternVLForMultiModal] with HasPretrained[InternVLForMultiModal]
trait ReadablePretrainedJanusForMultiModal extends ParamsAndFeaturesReadable[JanusForMultiModal] with HasPretrained[JanusForMultiModal]
trait ReadablePretrainedLLAVAForMultiModal extends ParamsAndFeaturesReadable[LLAVAForMultiModal] with HasPretrained[LLAVAForMultiModal]
trait ReadablePretrainedMLLamaForMultimodal extends ParamsAndFeaturesReadable[MLLamaForMultimodal] with HasPretrained[MLLamaForMultimodal]
trait ReadablePretrainedPaliGemmaForMultiModal extends ParamsAndFeaturesReadable[PaliGemmaForMultiModal] with HasPretrained[PaliGemmaForMultiModal]
trait ReadablePretrainedPhi3Vision extends ParamsAndFeaturesReadable[Phi3Vision] with HasPretrained[Phi3Vision]
trait ReadablePretrainedQwen2VLTransformer extends ParamsAndFeaturesReadable[Qwen2VLTransformer] with HasPretrained[Qwen2VLTransformer]
trait ReadablePretrainedSmolVLMTransformer extends ParamsAndFeaturesReadable[SmolVLMTransformer] with HasPretrained[SmolVLMTransformer]
trait ReadablePretrainedSwinForImageModel extends ParamsAndFeaturesReadable[SwinForImageClassification] with HasPretrained[SwinForImageClassification]
trait ReadablePretrainedViTForImageModel extends ParamsAndFeaturesReadable[ViTForImageClassification] with HasPretrained[ViTForImageClassification]
trait ReadablePretrainedVisionEncoderDecoderModel extends ParamsAndFeaturesReadable[VisionEncoderDecoderForImageCaptioning] with HasPretrained[VisionEncoderDecoderForImageCaptioning]
class SmolVLMTransformer extends AnnotatorModel[SmolVLMTransformer] with HasBatchedAnnotateImage[SmolVLMTransformer] with HasImageFeatureProperties with WriteOpenvinoModel with HasGeneratorProperties with HasEngine
SmolVLMTransformer can load SmolVLM models for visual question answering.
SmolVLMTransformer can load SmolVLM models for visual question answering. The model consists of a vision encoder, a text encoder as well as a text decoder. The vision encoder will encode the input image, the text encoder will encode the input question together with the encoding of the image, and the text decoder will output the answer to the question.
SmolVLM is a compact open multimodal model that accepts arbitrary sequences of image and text inputs to produce text outputs. Designed for efficiency, SmolVLM can answer questions about images, describe visual content, create stories grounded on multiple images, or function as a pure language model without visual inputs. Its lightweight architecture makes it suitable for on-device applications while maintaining strong performance on multimodal tasks.
Pretrained models can be loaded with pretrained of the companion object:
```
val visualQA = SmolVLMTransformer.pretrained()
  .setInputCols("image_assembler")
  .setOutputCol("answer")
```
The default model is "smolvlm_instruct_int4", if no name is provided.
For available pretrained models please see the Models Hub.
Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/test/scala/com/johnsnowlabs/nlp/annotators/cv/SmolVLMTransformerTest.scala.
Example
```
import spark.implicits._
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline

val imageDF: DataFrame = ResourceHelper.spark.read
 .format("image")
 .option("dropInvalid", value = true)
 .load(imageFolder)

val testDF: DataFrame = imageDF.withColumn("text", lit("<|im_start|>User:<image>Can you describe the image?<end_of_utterance>\nAssistant:"))

val imageAssembler: ImageAssembler = new ImageAssembler()
  .setInputCol("image")
  .setOutputCol("image_assembler")

val visualQAClassifier = SmolVLMTransformer.pretrained()
  .setInputCols("image_assembler")
  .setOutputCol("answer")

val pipeline = new Pipeline().setStages(Array(
  imageAssembler,
  visualQAClassifier
))

val result = pipeline.fit(testDF).transform(testDF)

result.select("image_assembler.origin", "answer.result").show(false)
+--------------------------------------+------+
|origin                                |result|
+--------------------------------------+------+
|[file:///content/images/cat_image.jpg]|[The unusual aspect of this picture is the presence of two cats lying on a pink couch]|
+--------------------------------------+------+
```
See also
CLIPForZeroShotClassification for Zero Shot Image Classifier
Annotators Main Page for a list of transformer based classifiers
class SwinForImageClassification extends ViTForImageClassification with HasRescaleFactor
SwinImageClassification is an image classifier based on Swin.
SwinImageClassification is an image classifier based on Swin.
The Swin Transformer was proposed in Swin Transformer: Hierarchical Vision Transformer using Shifted Windows by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
It is basically a hierarchical Transformer whose representation is computed with shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection.
Pretrained models can be loaded with pretrained of the companion object:
```
val imageClassifier = SwinForImageClassification.pretrained()
  .setInputCols("image_assembler")
  .setOutputCol("class")
```
The default model is "image_classifier_swin_base_patch4_window7_224", if no name is provided.
For available pretrained models please see the Models Hub.
Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see SwinForImageClassificationTest.
References:
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Paper Abstract:
This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with Shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test- dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the- art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures.
Example
```
import com.johnsnowlabs.nlp.annotator._
import com.johnsnowlabs.nlp.ImageAssembler
import org.apache.spark.ml.Pipeline

val imageDF: DataFrame = spark.read
  .format("image")
  .option("dropInvalid", value = true)
  .load("src/test/resources/image/")

val imageAssembler = new ImageAssembler()
  .setInputCol("image")
  .setOutputCol("image_assembler")

val imageClassifier = SwinForImageClassification
  .pretrained()
  .setInputCols("image_assembler")
  .setOutputCol("class")

val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineDF = pipeline.fit(imageDF).transform(imageDF)

pipelineDF
  .selectExpr("reverse(split(image.origin, '/'))[0] as image_name", "class.result")
  .show(truncate = false)
+-----------------+----------------------------------------------------------+
|image_name       |result                                                    |
+-----------------+----------------------------------------------------------+
|palace.JPEG      |[palace]                                                  |
|egyptian_cat.jpeg|[tabby, tabby cat]                                        |
|hippopotamus.JPEG|[hippopotamus, hippo, river horse, Hippopotamus amphibius]|
|hen.JPEG         |[hen]                                                     |
|ostrich.JPEG     |[ostrich, Struthio camelus]                               |
|junco.JPEG       |[junco, snowbird]                                         |
|bluetick.jpg     |[bluetick]                                                |
|chihuahua.jpg    |[Chihuahua]                                               |
|tractor.JPEG     |[tractor]                                                 |
|ox.JPEG          |[ox]                                                      |
+-----------------+----------------------------------------------------------+
```

class ViTForImageClassification extends AnnotatorModel[ViTForImageClassification] with HasBatchedAnnotateImage[ViTForImageClassification] with HasImageFeatureProperties with WriteTensorflowModel with WriteOnnxModel with WriteOpenvinoModel with HasEngine

Vision Transformer (ViT) for image classification.

ViT is a transformer based alternative to the convolutional neural networks usually used for image recognition tasks.

Pretrained models can be loaded with pretrained of the companion object:

val imageClassifier = ViTForImageClassification.pretrained()
  .setInputCols("image_assembler")
  .setOutputCol("class")

The default model is "image_classifier_vit_base_patch16_224", if no name is provided.

For available pretrained models please see the Models Hub.

References:

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Paper Abstract:

While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

Example

import com.johnsnowlabs.nlp.annotator._
import com.johnsnowlabs.nlp.ImageAssembler
import org.apache.spark.ml.Pipeline

val imageDF: DataFrame = spark.read
  .format("image")
  .option("dropInvalid", value = true)
  .load("src/test/resources/image/")

val imageAssembler = new ImageAssembler()
  .setInputCol("image")
  .setOutputCol("image_assembler")

val imageClassifier = ViTForImageClassification
  .pretrained()
  .setInputCols("image_assembler")
  .setOutputCol("class")

val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
val pipelineDF = pipeline.fit(imageDF).transform(imageDF)

pipelineDF
  .selectExpr("reverse(split(image.origin, '/'))[0] as image_name", "class.result")
  .show(truncate = false)
+-----------------+----------------------------------------------------------+
|image_name       |result                                                    |
+-----------------+----------------------------------------------------------+
|palace.JPEG      |[palace]                                                  |
|egyptian_cat.jpeg|[Egyptian cat]                                            |
|hippopotamus.JPEG|[hippopotamus, hippo, river horse, Hippopotamus amphibius]|
|hen.JPEG         |[hen]                                                     |
|ostrich.JPEG     |[ostrich, Struthio camelus]                               |
|junco.JPEG       |[junco, snowbird]                                         |
|bluetick.jpg     |[bluetick]                                                |
|chihuahua.jpg    |[Chihuahua]                                               |
|tractor.JPEG     |[tractor]                                                 |
|ox.JPEG          |[ox]                                                      |
+-----------------+----------------------------------------------------------+

class VisionEncoderDecoderForImageCaptioning extends AnnotatorModel[VisionEncoderDecoderForImageCaptioning] with HasBatchedAnnotateImage[VisionEncoderDecoderForImageCaptioning] with HasImageFeatureProperties with WriteTensorflowModel with WriteOnnxModel with WriteOpenvinoModel with HasEngine with HasRescaleFactor with HasGeneratorProperties

VisionEncoderDecoder model that converts images into text captions.

VisionEncoderDecoder model that converts images into text captions. It allows for the use of pretrained vision auto-encoding models, such as ViT, BEiT, or DeiT as the encoder, in combination with pretrained language models, like RoBERTa, GPT2, or BERT as the decoder.

Pretrained models can be loaded with pretrained of the companion object:

val imageClassifier = VisionEncoderDecoderForImageCaptioning.pretrained()
  .setInputCols("image_assembler")
  .setOutputCol("caption")

The default model is "image_captioning_vit_gpt2", if no name is provided.

For available pretrained models please see the Models Hub.

Note:

This is a very computationally expensive module especially on larger batch sizes. The use of an accelerator such as GPU is recommended.

Example

import com.johnsnowlabs.nlp.annotator._
import com.johnsnowlabs.nlp.ImageAssembler
import org.apache.spark.ml.Pipeline

val imageDF: DataFrame = spark.read
  .format("image")
  .option("dropInvalid", value = true)
  .load("src/test/resources/image/")

val imageAssembler = new ImageAssembler()
  .setInputCol("image")
  .setOutputCol("image_assembler")

val imageCaptioning = VisionEncoderDecoderForImageCaptioning
  .pretrained()
  .setBeamSize(2)
  .setDoSample(false)
  .setInputCols("image_assembler")
  .setOutputCol("caption")

val pipeline = new Pipeline().setStages(Array(imageAssembler, imageCaptioning))
val pipelineDF = pipeline.fit(imageDF).transform(imageDF)

pipelineDF
  .selectExpr("reverse(split(image.origin, '/'))[0] as image_name", "caption.result")
  .show(truncate = false)

+-----------------+---------------------------------------------------------+
|image_name       |result                                                   |
+-----------------+---------------------------------------------------------+
|palace.JPEG      |[a large room filled with furniture and a large window]  |
|egyptian_cat.jpeg|[a cat laying on a couch next to another cat]            |
|hippopotamus.JPEG|[a brown bear in a body of water]                        |
|hen.JPEG         |[a flock of chickens standing next to each other]        |
|ostrich.JPEG     |[a large bird standing on top of a lush green field]     |
|junco.JPEG       |[a small bird standing on a wet ground]                  |
|bluetick.jpg     |[a small dog standing on a wooden floor]                 |
|chihuahua.jpg    |[a small brown dog wearing a blue sweater]               |
|tractor.JPEG     |[a man is standing in a field with a tractor]            |
|ox.JPEG          |[a large brown cow standing on top of a lush green field]|
+-----------------+---------------------------------------------------------+

Value Members

object BLIPForQuestionAnswering extends ReadablePretrainedBLIPForQuestionAnswering with ReadBLIPForQuestionAnsweringDLModel with Serializable
object CLIPForZeroShotClassification extends ReadablePretrainedCLIPForZeroShotClassificationModel with ReadCLIPForZeroShotClassificationModel with Serializable
This is the companion object of CLIPForZeroShotClassification.
This is the companion object of CLIPForZeroShotClassification. Please refer to that class for the documentation.
object ConvNextForImageClassification extends ReadablePretrainedConvNextForImageModel with ReadConvNextForImageDLModel with Serializable
This is the companion object of ConvNextForImageClassification.
This is the companion object of ConvNextForImageClassification. Please refer to that class for the documentation.
object Florence2Transformer extends ReadablePretrainedFlorence2TransformerModel with ReadFlorence2TransformerDLModel with Serializable
object Gemma3ForMultiModal extends ReadablePretrainedGemma3ForMultiModal with ReadGemma3ForMultiModalDLModel with Serializable
object InternVLForMultiModal extends ReadablePretrainedInternVLForMultiModal with ReadInternVLForMultiModalDLModel with Serializable
object JanusForMultiModal extends ReadablePretrainedJanusForMultiModal with ReadJanusForMultiModalDLModel with Serializable
object LLAVAForMultiModal extends ReadablePretrainedLLAVAForMultiModal with ReadLLAVAForMultiModalDLModel with Serializable
object MLLamaForMultimodal extends ReadablePretrainedMLLamaForMultimodal with ReadMLLamaForMultimodalDLModel with Serializable
object PaliGemmaForMultiModal extends ReadablePretrainedPaliGemmaForMultiModal with ReadPaliGemmaForMultiModalDLModel with Serializable
object Phi3Vision extends ReadablePretrainedPhi3Vision with ReadPhi3VisionDLModel with Serializable
object Qwen2VLTransformer extends ReadablePretrainedQwen2VLTransformer with ReadQwen2VLTransformerDLModel with Serializable
object SmolVLMTransformer extends ReadablePretrainedSmolVLMTransformer with ReadSmolVLMTransformerDLModel with Serializable
object SwinForImageClassification extends ReadablePretrainedSwinForImageModel with ReadSwinForImageDLModel with Serializable
This is the companion object of SwinForImageClassification.
This is the companion object of SwinForImageClassification. Please refer to that class for the documentation.
object ViTForImageClassification extends ReadablePretrainedViTForImageModel with ReadViTForImageDLModel with Serializable
This is the companion object of ViTForImageClassification.
This is the companion object of ViTForImageClassification. Please refer to that class for the documentation.
object VisionEncoderDecoderForImageCaptioning extends ReadablePretrainedVisionEncoderDecoderModel with ReadVisionEncoderDecoderDLModel with Serializable
This is the companion object of VisionEncoderDecoderForImageCaptioning.
This is the companion object of VisionEncoderDecoderForImageCaptioning. Please refer to that class for the documentation.

Packages

cv

package cv

Type Members

Example

Example

Example

Supported Tasks

Example Usage

References

Example

Example

Example

Example

Example

Example

Example

Example

Example

Example

Example

Example

Value Members

Ungrouped

Packages

cv 

package cv

Type Members

Example

Example

Example

Supported Tasks

Example Usage

References

Example

Example

Example

Example

Example

Example

Example

Example

Example

Example

Example

Example

Value Members

Ungrouped

cv