Packages

package cv

Ordering
  1. Alphabetic
Visibility
  1. Public
  2. All

Type Members

  1. class BLIPForQuestionAnswering extends AnnotatorModel[BLIPForQuestionAnswering] with HasBatchedAnnotateImage[BLIPForQuestionAnswering] with HasImageFeatureProperties with WriteTensorflowModel with HasEngine

    BLIPForQuestionAnswering can load BLIP models for visual question answering.

    BLIPForQuestionAnswering can load BLIP models for visual question answering. The model consists of a vision encoder, a text encoder as well as a text decoder. The vision encoder will encode the input image, the text encoder will encode the input question together with the encoding of the image, and the text decoder will output the answer to the question.

    Pretrained models can be loaded with pretrained of the companion object:

    val visualQAClassifier = BLIPForQuestionAnswering.pretrained()
      .setInputCols("image_assembler")
      .setOutputCol("answer")

    The default model is "blip_vqa_base", if no name is provided.

    For available pretrained models please see the Models Hub.

    Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/test/scala/com/johnsnowlabs/nlp/annotators/cv/BLIPForQuestionAnsweringTest.scala.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    import org.apache.spark.ml.Pipeline
    
    val imageDF: DataFrame = ResourceHelper.spark.read
     .format("image")
     .option("dropInvalid", value = true)
     .load(imageFolder)
    
    val testDF: DataFrame = imageDF.withColumn("text", lit("What's this picture about?"))
    
    val imageAssembler: ImageAssembler = new ImageAssembler()
      .setInputCol("image")
      .setOutputCol("image_assembler")
    
    val visualQAClassifier = BLIPForQuestionAnswering.pretrained()
      .setInputCols("image_assembler")
      .setOutputCol("answer")
    
    val pipeline = new Pipeline().setStages(Array(
      imageAssembler,
      visualQAClassifier
    ))
    
    val result = pipeline.fit(testDF).transform(testDF)
    
    result.select("image_assembler.origin", "answer.result").show(false)
    +--------------------------------------+------+
    |origin                                |result|
    +--------------------------------------+------+
    |[file:///content/images/cat_image.jpg]|[cats]|
    +--------------------------------------+------+
    See also

    CLIPForZeroShotClassification for Zero Shot Image Classifier

    Annotators Main Page for a list of transformer based classifiers

  2. class CLIPForZeroShotClassification extends AnnotatorModel[CLIPForZeroShotClassification] with HasBatchedAnnotateImage[CLIPForZeroShotClassification] with HasImageFeatureProperties with WriteTensorflowModel with WriteOnnxModel with WriteOpenvinoModel with HasEngine with HasRescaleFactor

    Zero Shot Image Classifier based on CLIP.

    Zero Shot Image Classifier based on CLIP.

    CLIP (Contrastive Language-Image Pre-Training) is a neural network that was trained on image and text pairs. It has the ability to predict images without training on any hard-coded labels. This makes it very flexible, as labels can be provided during inference. This is similar to the zero-shot capabilities of the GPT-2 and 3 models.

    Pretrained models can be loaded with pretrained of the companion object:

    val imageClassifier = CLIPForZeroShotClassification.pretrained()
      .setInputCols("image_assembler")
      .setOutputCol("label")

    The default model is "zero_shot_classifier_clip_vit_base_patch32", if no name is provided.

    For available pretrained models please see the Models Hub.

    Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see CLIPForZeroShotClassificationTestSpec.

    Example

    import com.johnsnowlabs.nlp.ImageAssembler
    import com.johnsnowlabs.nlp.annotator._
    import org.apache.spark.ml.Pipeline
    
    val imageDF = ResourceHelper.spark.read
      .format("image")
      .option("dropInvalid", value = true)
      .load("src/test/resources/image/")
    
    val imageAssembler: ImageAssembler = new ImageAssembler()
      .setInputCol("image")
      .setOutputCol("image_assembler")
    
    val candidateLabels = Array(
      "a photo of a bird",
      "a photo of a cat",
      "a photo of a dog",
      "a photo of a hen",
      "a photo of a hippo",
      "a photo of a room",
      "a photo of a tractor",
      "a photo of an ostrich",
      "a photo of an ox")
    
    val imageClassifier = CLIPForZeroShotClassification
      .pretrained()
      .setInputCols("image_assembler")
      .setOutputCol("label")
      .setCandidateLabels(candidateLabels)
    
    val pipeline =
      new Pipeline().setStages(Array(imageAssembler, imageClassifier)).fit(imageDF).transform(imageDF)
    
    pipeline
      .selectExpr("reverse(split(image.origin, '/'))[0] as image_name", "label.result")
      .show(truncate = false)
    +-----------------+-----------------------+
    |image_name       |result                 |
    +-----------------+-----------------------+
    |palace.JPEG      |[a photo of a room]    |
    |egyptian_cat.jpeg|[a photo of a cat]     |
    |hippopotamus.JPEG|[a photo of a hippo]   |
    |hen.JPEG         |[a photo of a hen]     |
    |ostrich.JPEG     |[a photo of an ostrich]|
    |junco.JPEG       |[a photo of a bird]    |
    |bluetick.jpg     |[a photo of a dog]     |
    |chihuahua.jpg    |[a photo of a dog]     |
    |tractor.JPEG     |[a photo of a tractor] |
    |ox.JPEG          |[a photo of an ox]     |
    +-----------------+-----------------------+
  3. class ConvNextForImageClassification extends SwinForImageClassification

    ConvNextForImageClassification is an image classifier based on ConvNet models.

    ConvNextForImageClassification is an image classifier based on ConvNet models.

    The ConvNeXT model was proposed in A ConvNet for the 2020s by Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie. ConvNeXT is a pure convolutional model (ConvNet), inspired by the design of Vision Transformers, that claims to outperform them.

    Pretrained models can be loaded with pretrained of the companion object:

    val imageClassifier = ConvNextForImageClassification.pretrained()
      .setInputCols("image_assembler")
      .setOutputCol("class")

    The default model is "image_classifier_convnext_tiny_224_local", if no name is provided.

    For available pretrained models please see the Models Hub.

    Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see ConvNextForImageClassificationTestSpec.

    References:

    A ConvNet for the 2020s

    Paper Abstract:

    The "Roaring 20s" of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly superseded ConvNets as the state-of-the-art image classification model. A vanilla ViT, on the other hand, faces difficulties when applied to general computer vision tasks such as object detection and semantic segmentation. It is the hierarchical Transformers (e.g., Swin Transformers) that reintroduced several ConvNet priors, making Transformers practically viable as a generic vision backbone and demonstrating remarkable performance on a wide variety of vision tasks. However, the effectiveness of such hybrid approaches is still largely credited to the intrinsic superiority of Transformers, rather than the inherent inductive biases of convolutions. In this work, we reexamine the design spaces and test the limits of what a pure ConvNet can achieve. We gradually "modernize" a standard ResNet toward the design of a vision Transformer, and discover several key components that contribute to the performance difference along the way. The outcome of this exploration is a family of pure ConvNet models dubbed ConvNeXt. Constructed entirely from standard ConvNet modules, ConvNeXts compete favorably with Transformers in terms of accuracy and scalability, achieving 87.8% ImageNet top-1 accuracy and outperforming Swin Transformers on COCO detection and ADE20K segmentation, while maintaining the simplicity and efficiency of standard ConvNets.

    Example

    import com.johnsnowlabs.nlp.annotator._
    import com.johnsnowlabs.nlp.ImageAssembler
    import org.apache.spark.ml.Pipeline
    
    val imageDF: DataFrame = spark.read
      .format("image")
      .option("dropInvalid", value = true)
      .load("src/test/resources/image/")
    
    val imageAssembler = new ImageAssembler()
      .setInputCol("image")
      .setOutputCol("image_assembler")
    
    val imageClassifier = ConvNextForImageClassification
      .pretrained()
      .setInputCols("image_assembler")
      .setOutputCol("class")
    
    val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
    val pipelineDF = pipeline.fit(imageDF).transform(imageDF)
    
    pipelineDF
      .selectExpr("reverse(split(image.origin, '/'))[0] as image_name", "class.result")
      .show(truncate = false)
    +-----------------+----------------------------------------------------------+
    |image_name       |result                                                    |
    +-----------------+----------------------------------------------------------+
    |palace.JPEG      |[palace]                                                  |
    |egyptian_cat.jpeg|[tabby, tabby cat]                                        |
    |hippopotamus.JPEG|[hippopotamus, hippo, river horse, Hippopotamus amphibius]|
    |hen.JPEG         |[hen]                                                     |
    |ostrich.JPEG     |[ostrich, Struthio camelus]                               |
    |junco.JPEG       |[junco, snowbird]                                         |
    |bluetick.jpg     |[bluetick]                                                |
    |chihuahua.jpg    |[Chihuahua]                                               |
    |tractor.JPEG     |[tractor]                                                 |
    |ox.JPEG          |[ox]                                                      |
    +-----------------+----------------------------------------------------------+
  4. class Gemma3ForMultiModal extends AnnotatorModel[Gemma3ForMultiModal] with HasBatchedAnnotateImage[Gemma3ForMultiModal] with HasImageFeatureProperties with WriteOpenvinoModel with HasGeneratorProperties with HasEngine

    Gemma3ForMultiModal can load Gemma3 Vision models for visual question answering.

    Gemma3ForMultiModal can load Gemma3 Vision models for visual question answering. The model consists of a vision encoder, a text encoder, a text decoder and a model merger. The vision encoder will encode the input image, the text encoder will encode the input text, the model merger will merge the image and text embeddings, and the text decoder will output the answer.

    Gemma 3 is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. Key features include:

    • Large 128K context window
    • Multilingual support in over 140 languages
    • Multimodal capabilities handling both text and image inputs
    • Optimized for deployment on limited resources (laptops, desktops, cloud)

    Pretrained models can be loaded with pretrained of the companion object:

    val visualQA = Gemma3ForMultiModal.pretrained()
      .setInputCols("image_assembler")
      .setOutputCol("answer")

    The default model is "gemma3_4b_it_int4", if no name is provided.

    For available pretrained models please see the Models Hub.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    import org.apache.spark.ml.Pipeline
    
    val imageDF = spark.read
      .format("image")
      .option("dropInvalid", value = true)
      .load(imageFolder)
    
    val testDF = imageDF.withColumn("text", lit("<bos><start_of_turn>user\nYou are a helpful assistant.\n\n<start_of_image>Describe this image in detail.<end_of_turn>\n<start_of_turn>model\n"))
    
    val imageAssembler = new ImageAssembler()
      .setInputCol("image")
      .setOutputCol("image_assembler")
    
    val visualQA = Gemma3ForMultiModal.pretrained()
      .setInputCols("image_assembler")
      .setOutputCol("answer")
    
    val pipeline = new Pipeline().setStages(Array(
      imageAssembler,
      visualQA
    ))
    
    val result = pipeline.fit(testDF).transform(testDF)
    
    result.select("image_assembler.origin", "answer.result").show(truncate = false)
  5. trait HasRescaleFactor extends AnyRef

    Enables parameters to handle rescaling for image pre-processors.

  6. class JanusForMultiModal extends AnnotatorModel[JanusForMultiModal] with HasBatchedAnnotateImage[JanusForMultiModal] with HasImageFeatureProperties with WriteOpenvinoModel with HasGeneratorProperties with HasEngine

    JanusForMultiModal can load Janus models for unified multimodal understanding and generation.

    JanusForMultiModal can load Janus models for unified multimodal understanding and generation. The model consists of a vision encoder, a text encoder, and a text decoder. Janus decouples visual encoding for enhanced flexibility, leveraging a unified transformer architecture for both understanding and generation tasks.

    Janus uses SigLIP-L as the vision encoder, supporting 384 x 384 image inputs. For image generation, it utilizes a tokenizer with a downsample rate of 16. The framework is based on DeepSeek-LLM-1.3b-base, trained on approximately 500B text tokens.

    Pretrained models can be loaded with pretrained from the companion object: {{ val visualQA = JanusForMultiModal.pretrained() .setInputCols("image_assembler") .setOutputCol("answer") }} The default model is "janus_1_3b_int4" if no name is provided.

    For available pretrained models, please refer to the Models Hub.

    Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. For compatibility details and import instructions, see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669. For extended examples, refer to https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/test/scala/com/johnsnowlabs/nlp/annotators/cv/JanusForMultiModalTest.scala.

    Example

    {{ import spark.implicits._

    import com.johnsnowlabs.nlp.base._

    import com.johnsnowlabs.nlp.annotator._

    import org.apache.spark.ml.Pipeline

    val imageDF: DataFrame = ResourceHelper.spark.read .format("image") .option("dropInvalid", value = true) .load(imageFolder)

    val testDF: DataFrame = imageDF.withColumn("text", lit("User: <image_placeholder>Describe image in details Assistant:"))

    val imageAssembler: ImageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler")

    val visualQAClassifier = JanusForMultiModal.pretrained() .setInputCols("image_assembler") .setOutputCol("answer")

    val pipeline = new Pipeline().setStages(Array( imageAssembler, visualQAClassifier ))

    val result = pipeline.fit(testDF).transform(testDF)

    result.select("image_assembler.origin", "answer.result").show(false)

    origin

    result

    [file:///content/images/cat_image.jpg]

    [The unusual aspect of this picture is the presence of two cats lying on a pink couch.]

    }}

    See also

    CLIPForZeroShotClassification for Zero Shot Image Classification

    Annotators Main Page for a list of transformer-based classifiers

  7. class LLAVAForMultiModal extends AnnotatorModel[LLAVAForMultiModal] with HasBatchedAnnotateImage[LLAVAForMultiModal] with HasImageFeatureProperties with WriteOpenvinoModel with HasGeneratorProperties with HasEngine

    LLAVAForMultiModal can load LLAVA Vision models for visual question answering.

    LLAVAForMultiModal can load LLAVA Vision models for visual question answering. The model consists of a vision encoder, a text encoder as well as a text decoder. The vision encoder will encode the input image, the text encoder will encode the input question together with the encoding of the image, and the text decoder will output the answer to the question.

    Pretrained models can be loaded with pretrained of the companion object:

    val visualQA = LLAVAForMultiModal.pretrained()
      .setInputCols("image_assembler")
      .setOutputCol("answer")

    The default model is "llava_1_5_7b_hf", if no name is provided.

    For available pretrained models please see the Models Hub.

    Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/test/scala/com/johnsnowlabs/nlp/annotators/cv/LLAVAForMultiModalTest.scala.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    import org.apache.spark.ml.Pipeline
    
    val imageDF: DataFrame = ResourceHelper.spark.read
     .format("image")
     .option("dropInvalid", value = true)
     .load(imageFolder)
    
    val testDF: DataFrame = imageDF.withColumn("text", lit("USER: \n <|image|> \nWhat is unusual on this picture? \n ASSISTANT:\n"))
    
    val imageAssembler: ImageAssembler = new ImageAssembler()
      .setInputCol("image")
      .setOutputCol("image_assembler")
    
    val visualQAClassifier = LLAVAForMultiModal.pretrained()
      .setInputCols("image_assembler")
      .setOutputCol("answer")
    
    val pipeline = new Pipeline().setStages(Array(
      imageAssembler,
      visualQAClassifier
    ))
    
    val result = pipeline.fit(testDF).transform(testDF)
    
    result.select("image_assembler.origin", "answer.result").show(false)
    +--------------------------------------+------+
    |origin                                |result|
    +--------------------------------------+------+
    |[file:///content/images/cat_image.jpg]|[The unusual aspect of this picture is the presence of two cats lying on a pink couch]|
    +--------------------------------------+------+
    See also

    CLIPForZeroShotClassification for Zero Shot Image Classifier

    Annotators Main Page for a list of transformer based classifiers

  8. class MLLamaForMultimodal extends AnnotatorModel[MLLamaForMultimodal] with HasBatchedAnnotateImage[MLLamaForMultimodal] with HasImageFeatureProperties with WriteOpenvinoModel with HasGeneratorProperties with HasEngine

    MLLamaForMultimodal can load LLAMA 3.2 Vision models for visual question answering.

    MLLamaForMultimodal can load LLAMA 3.2 Vision models for visual question answering. The model consists of a vision encoder, a text encoder as well as a text decoder. The vision encoder will encode the input image, the text encoder will encode the input question together with the encoding of the image, and the text decoder will output the answer to the question.

    The Llama 3.2-Vision collection of multimodal large language models (LLMs) is a collection of pretrained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text + images in / text out). The Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks.

    Pretrained models can be loaded with pretrained of the companion object:

    val visualQA = MLLamaForMultimodal.pretrained()
      .setInputCols("image_assembler")
      .setOutputCol("answer")

    The default model is "llama_3_2_11b_vision_instruct_int4", if no name is provided.

    For available pretrained models please see the Models Hub.

    Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/test/scala/com/johnsnowlabs/nlp/annotators/cv/MLLamaForMultimodalTest.scala.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    import org.apache.spark.ml.Pipeline
    
    val imageDF: DataFrame = ResourceHelper.spark.read
     .format("image")
     .option("dropInvalid", value = true)
     .load(imageFolder)
    
    val testDF: DataFrame = imageDF.withColumn("text", lit("<|begin_of_text|><|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n<|image|>What is unusual on this image?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"))
    
    val imageAssembler: ImageAssembler = new ImageAssembler()
      .setInputCol("image")
      .setOutputCol("image_assembler")
    
    val visualQAClassifier = MLLamaForMultimodal.pretrained()
      .setInputCols("image_assembler")
      .setOutputCol("answer")
    
    val pipeline = new Pipeline().setStages(Array(
      imageAssembler,
      visualQAClassifier
    ))
    
    val result = pipeline.fit(testDF).transform(testDF)
    
    result.select("image_assembler.origin", "answer.result").show(false)
    +--------------------------------------+------+
    |origin                                |result|
    +--------------------------------------+------+
    |[file:///content/images/cat_image.jpg]|[The unusual aspect of this picture is the presence of two cats lying on a pink couch]|
    +--------------------------------------+------+
    See also

    CLIPForZeroShotClassification for Zero Shot Image Classifier

    Annotators Main Page for a list of transformer based classifiers

  9. class PaliGemmaForMultiModal extends AnnotatorModel[PaliGemmaForMultiModal] with HasBatchedAnnotateImage[PaliGemmaForMultiModal] with HasImageFeatureProperties with WriteOpenvinoModel with HasGeneratorProperties with HasEngine

    PaliGemmaForMultiModal can load PaliGemma Vision models for visual question answering.

    PaliGemmaForMultiModal can load PaliGemma Vision models for visual question answering. The model consists of a vision encoder, a text encoder, a text decoder and a model merger. The vision encoder will encode the input image, the text encoder will encode the input text, the model merger will merge the image and text embeddings, and the text decoder will output the answer.

    Pretrained models can be loaded with pretrained of the companion object:

    val visualQA = PaliGemmaForMultiModal.pretrained()
      .setInputCols("image_assembler")
      .setOutputCol("answer")

    The default model is "paligemma_3b_pt_224_int4", if no name is provided.

    For available pretrained models please see the Models Hub.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    import org.apache.spark.ml.Pipeline
    
    val imageDF: DataFrame = ResourceHelper.spark.read
     .format("image")
     .option("dropInvalid", value = true)
     .load(imageFolder)
    
    val testDF: DataFrame = imageDF.withColumn("text", lit("USER: \n <|image|> \nWhat is unusual on this picture? \n ASSISTANT:\n"))
    
    val imageAssembler: ImageAssembler = new ImageAssembler()
      .setInputCol("image")
      .setOutputCol("image_assembler")
    
    val visualQAClassifier = PaliGemmaForMultiModal.pretrained()
      .setInputCols("image_assembler")
      .setOutputCol("answer")
    
    val pipeline = new Pipeline().setStages(Array(
      imageAssembler,
      visualQAClassifier
    ))
    
    val result = pipeline.fit(testDF).transform(testDF)
    
    result.select("image_assembler.origin", "answer.result").show(false)
  10. class Phi3Vision extends AnnotatorModel[Phi3Vision] with HasBatchedAnnotateImage[Phi3Vision] with HasImageFeatureProperties with WriteOpenvinoModel with HasGeneratorProperties with HasEngine

    Phi3Vision can load Phi3 Vision models for visual question answering.

    Phi3Vision can load Phi3 Vision models for visual question answering. The model consists of a vision encoder, a text encoder as well as a text decoder. The vision encoder will encode the input image, the text encoder will encode the input question together with the encoding of the image, and the text decoder will output the answer to the question.

    Pretrained models can be loaded with pretrained of the companion object:

    val visualQA = Phi3Vision.pretrained()
      .setInputCols("image_assembler")
      .setOutputCol("answer")

    The default model is "phi_3_vision_128k_instruct", if no name is provided.

    For available pretrained models please see the Models Hub.

    Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/test/scala/com/johnsnowlabs/nlp/annotators/cv/Phi3VisionTest.scala.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    import org.apache.spark.ml.Pipeline
    
    val imageDF: DataFrame = ResourceHelper.spark.read
     .format("image")
     .option("dropInvalid", value = true)
     .load(imageFolder)
    
    val testDF: DataFrame = imageDF.withColumn("text", lit("<|user|> \n <|image_1|> \nWhat is unusual on this picture? <|end|>\n <|assistant|>\n"))
    
    val imageAssembler: ImageAssembler = new ImageAssembler()
      .setInputCol("image")
      .setOutputCol("image_assembler")
    
    val visualQAClassifier = Phi3Vision.pretrained("phi_3_vision_128k_instruct","en")
      .setInputCols("image_assembler")
      .setOutputCol("answer")
    
    val pipeline = new Pipeline().setStages(Array(
      imageAssembler,
      visualQAClassifier
    ))
    
    val result = pipeline.fit(testDF).transform(testDF)
    
    result.select("image_assembler.origin", "answer.result").show(false)
    +--------------------------------------+------+
    |origin                                |result|
    +--------------------------------------+------+
    |[file:///content/images/cat_image.jpg]|[The unusual aspect of this picture is the presence of two cats lying on a pink couch]|
    +--------------------------------------+------+
    See also

    CLIPForZeroShotClassification for Zero Shot Image Classifier

    Annotators Main Page for a list of transformer based classifiers

  11. class Qwen2VLTransformer extends AnnotatorModel[Qwen2VLTransformer] with HasBatchedAnnotateImage[Qwen2VLTransformer] with HasImageFeatureProperties with WriteOpenvinoModel with HasGeneratorProperties with HasEngine

    Qwen2VLTransformer can load Qwen2 Vision-Language models for visual question answering and multimodal instruction following.

    Qwen2VLTransformer can load Qwen2 Vision-Language models for visual question answering and multimodal instruction following. The model consists of a vision encoder, a text encoder, and a text decoder. The vision encoder processes the input image, the text encoder integrates the encoding of the image with the input text, and the text decoder outputs the response to the query or instruction.

    Pretrained models can be loaded with pretrained of the companion object:

    val visualQA = Qwen2VLTransformer.pretrained()
      .setInputCols("image_assembler")
      .setOutputCol("answer")

    The default model is "qwen2_vl_2b_instruct_int4", if no name is provided.

    For available pretrained models, please see the Models Hub.

    Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are compatible and how to import them, see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669. To explore more extended examples, see https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/test/scala/com/johnsnowlabs/nlp/annotators/cv/Qwen2VLTransformerTest.scala.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    import org.apache.spark.ml.Pipeline
    
    val imageDF: DataFrame = ResourceHelper.spark.read
      .format("image")
      .option("dropInvalid", value = true)
      .load(imageFolder)
    
    val testDF: DataFrame = imageDF.withColumn("text", lit("<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n"))
    
    val imageAssembler: ImageAssembler = new ImageAssembler()
      .setInputCol("image")
      .setOutputCol("image_assembler")
    
    val visualQAClassifier = Qwen2VLTransformer.pretrained()
      .setInputCols("image_assembler")
      .setOutputCol("answer")
    
    val pipeline = new Pipeline().setStages(Array(
      imageAssembler,
      visualQAClassifier
    ))
    
    val result = pipeline.fit(testDF).transform(testDF)
    
    result.select("image_assembler.origin", "answer.result").show(false)
    +--------------------------------------+------+
    |origin                                |result|
    +--------------------------------------+------+
    |[file:///content/images/cat_image.jpg]|[This image is unusual because it features two cats lying on a pink couch.]|
    +--------------------------------------+------+
    See also

    Annotators Main Page for a list of transformer- based classifiers

  12. trait ReadBLIPForQuestionAnsweringDLModel extends ReadTensorflowModel
  13. trait ReadCLIPForZeroShotClassificationModel extends ReadTensorflowModel with ReadOnnxModel with ReadOpenvinoModel
  14. trait ReadConvNextForImageDLModel extends ReadTensorflowModel with ReadOnnxModel with ReadOpenvinoModel
  15. trait ReadGemma3ForMultiModalDLModel extends ReadOpenvinoModel
  16. trait ReadJanusForMultiModalDLModel extends ReadOpenvinoModel
  17. trait ReadLLAVAForMultiModalDLModel extends ReadOpenvinoModel
  18. trait ReadMLLamaForMultimodalDLModel extends ReadOpenvinoModel
  19. trait ReadPaliGemmaForMultiModalDLModel extends ReadOpenvinoModel
  20. trait ReadPhi3VisionDLModel extends ReadOpenvinoModel
  21. trait ReadQwen2VLTransformerDLModel extends ReadOpenvinoModel
  22. trait ReadSmolVLMTransformerDLModel extends ReadOpenvinoModel
  23. trait ReadSwinForImageDLModel extends ReadTensorflowModel with ReadOnnxModel with ReadOpenvinoModel
  24. trait ReadViTForImageDLModel extends ReadTensorflowModel with ReadOnnxModel with ReadOpenvinoModel
  25. trait ReadVisionEncoderDecoderDLModel extends ReadTensorflowModel with ReadOnnxModel with ReadOpenvinoModel
  26. trait ReadablePretrainedBLIPForQuestionAnswering extends ParamsAndFeaturesReadable[BLIPForQuestionAnswering] with HasPretrained[BLIPForQuestionAnswering]
  27. trait ReadablePretrainedCLIPForZeroShotClassificationModel extends ParamsAndFeaturesReadable[CLIPForZeroShotClassification] with HasPretrained[CLIPForZeroShotClassification]
  28. trait ReadablePretrainedConvNextForImageModel extends ParamsAndFeaturesReadable[ConvNextForImageClassification] with HasPretrained[ConvNextForImageClassification]
  29. trait ReadablePretrainedGemma3ForMultiModal extends ParamsAndFeaturesReadable[Gemma3ForMultiModal] with HasPretrained[Gemma3ForMultiModal]
  30. trait ReadablePretrainedJanusForMultiModal extends ParamsAndFeaturesReadable[JanusForMultiModal] with HasPretrained[JanusForMultiModal]
  31. trait ReadablePretrainedLLAVAForMultiModal extends ParamsAndFeaturesReadable[LLAVAForMultiModal] with HasPretrained[LLAVAForMultiModal]
  32. trait ReadablePretrainedMLLamaForMultimodal extends ParamsAndFeaturesReadable[MLLamaForMultimodal] with HasPretrained[MLLamaForMultimodal]
  33. trait ReadablePretrainedPaliGemmaForMultiModal extends ParamsAndFeaturesReadable[PaliGemmaForMultiModal] with HasPretrained[PaliGemmaForMultiModal]
  34. trait ReadablePretrainedPhi3Vision extends ParamsAndFeaturesReadable[Phi3Vision] with HasPretrained[Phi3Vision]
  35. trait ReadablePretrainedQwen2VLTransformer extends ParamsAndFeaturesReadable[Qwen2VLTransformer] with HasPretrained[Qwen2VLTransformer]
  36. trait ReadablePretrainedSmolVLMTransformer extends ParamsAndFeaturesReadable[SmolVLMTransformer] with HasPretrained[SmolVLMTransformer]
  37. trait ReadablePretrainedSwinForImageModel extends ParamsAndFeaturesReadable[SwinForImageClassification] with HasPretrained[SwinForImageClassification]
  38. trait ReadablePretrainedViTForImageModel extends ParamsAndFeaturesReadable[ViTForImageClassification] with HasPretrained[ViTForImageClassification]
  39. trait ReadablePretrainedVisionEncoderDecoderModel extends ParamsAndFeaturesReadable[VisionEncoderDecoderForImageCaptioning] with HasPretrained[VisionEncoderDecoderForImageCaptioning]
  40. class SmolVLMTransformer extends AnnotatorModel[SmolVLMTransformer] with HasBatchedAnnotateImage[SmolVLMTransformer] with HasImageFeatureProperties with WriteOpenvinoModel with HasGeneratorProperties with HasEngine

    SmolVLMTransformer can load SmolVLM models for visual question answering.

    SmolVLMTransformer can load SmolVLM models for visual question answering. The model consists of a vision encoder, a text encoder as well as a text decoder. The vision encoder will encode the input image, the text encoder will encode the input question together with the encoding of the image, and the text decoder will output the answer to the question.

    SmolVLM is a compact open multimodal model that accepts arbitrary sequences of image and text inputs to produce text outputs. Designed for efficiency, SmolVLM can answer questions about images, describe visual content, create stories grounded on multiple images, or function as a pure language model without visual inputs. Its lightweight architecture makes it suitable for on-device applications while maintaining strong performance on multimodal tasks.

    Pretrained models can be loaded with pretrained of the companion object:

    val visualQA = SmolVLMTransformer.pretrained()
      .setInputCols("image_assembler")
      .setOutputCol("answer")

    The default model is "smolvlm_instruct_int4", if no name is provided.

    For available pretrained models please see the Models Hub.

    Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/test/scala/com/johnsnowlabs/nlp/annotators/cv/SmolVLMTransformerTest.scala.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    import org.apache.spark.ml.Pipeline
    
    val imageDF: DataFrame = ResourceHelper.spark.read
     .format("image")
     .option("dropInvalid", value = true)
     .load(imageFolder)
    
    val testDF: DataFrame = imageDF.withColumn("text", lit("<|im_start|>User:<image>Can you describe the image?<end_of_utterance>\nAssistant:"))
    
    val imageAssembler: ImageAssembler = new ImageAssembler()
      .setInputCol("image")
      .setOutputCol("image_assembler")
    
    val visualQAClassifier = SmolVLMTransformer.pretrained()
      .setInputCols("image_assembler")
      .setOutputCol("answer")
    
    val pipeline = new Pipeline().setStages(Array(
      imageAssembler,
      visualQAClassifier
    ))
    
    val result = pipeline.fit(testDF).transform(testDF)
    
    result.select("image_assembler.origin", "answer.result").show(false)
    +--------------------------------------+------+
    |origin                                |result|
    +--------------------------------------+------+
    |[file:///content/images/cat_image.jpg]|[The unusual aspect of this picture is the presence of two cats lying on a pink couch]|
    +--------------------------------------+------+
    See also

    CLIPForZeroShotClassification for Zero Shot Image Classifier

    Annotators Main Page for a list of transformer based classifiers

  41. class SwinForImageClassification extends ViTForImageClassification with HasRescaleFactor

    SwinImageClassification is an image classifier based on Swin.

    SwinImageClassification is an image classifier based on Swin.

    The Swin Transformer was proposed in Swin Transformer: Hierarchical Vision Transformer using Shifted Windows by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.

    It is basically a hierarchical Transformer whose representation is computed with shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection.

    Pretrained models can be loaded with pretrained of the companion object:

    val imageClassifier = SwinForImageClassification.pretrained()
      .setInputCols("image_assembler")
      .setOutputCol("class")

    The default model is "image_classifier_swin_base_patch4_window7_224", if no name is provided.

    For available pretrained models please see the Models Hub.

    Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see SwinForImageClassificationTest.

    References:

    Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

    Paper Abstract:

    This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with Shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test- dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the- art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures.

    Example

    import com.johnsnowlabs.nlp.annotator._
    import com.johnsnowlabs.nlp.ImageAssembler
    import org.apache.spark.ml.Pipeline
    
    val imageDF: DataFrame = spark.read
      .format("image")
      .option("dropInvalid", value = true)
      .load("src/test/resources/image/")
    
    val imageAssembler = new ImageAssembler()
      .setInputCol("image")
      .setOutputCol("image_assembler")
    
    val imageClassifier = SwinForImageClassification
      .pretrained()
      .setInputCols("image_assembler")
      .setOutputCol("class")
    
    val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
    val pipelineDF = pipeline.fit(imageDF).transform(imageDF)
    
    pipelineDF
      .selectExpr("reverse(split(image.origin, '/'))[0] as image_name", "class.result")
      .show(truncate = false)
    +-----------------+----------------------------------------------------------+
    |image_name       |result                                                    |
    +-----------------+----------------------------------------------------------+
    |palace.JPEG      |[palace]                                                  |
    |egyptian_cat.jpeg|[tabby, tabby cat]                                        |
    |hippopotamus.JPEG|[hippopotamus, hippo, river horse, Hippopotamus amphibius]|
    |hen.JPEG         |[hen]                                                     |
    |ostrich.JPEG     |[ostrich, Struthio camelus]                               |
    |junco.JPEG       |[junco, snowbird]                                         |
    |bluetick.jpg     |[bluetick]                                                |
    |chihuahua.jpg    |[Chihuahua]                                               |
    |tractor.JPEG     |[tractor]                                                 |
    |ox.JPEG          |[ox]                                                      |
    +-----------------+----------------------------------------------------------+
  42. class ViTForImageClassification extends AnnotatorModel[ViTForImageClassification] with HasBatchedAnnotateImage[ViTForImageClassification] with HasImageFeatureProperties with WriteTensorflowModel with WriteOnnxModel with WriteOpenvinoModel with HasEngine

    Vision Transformer (ViT) for image classification.

    Vision Transformer (ViT) for image classification.

    ViT is a transformer based alternative to the convolutional neural networks usually used for image recognition tasks.

    Pretrained models can be loaded with pretrained of the companion object:

    val imageClassifier = ViTForImageClassification.pretrained()
      .setInputCols("image_assembler")
      .setOutputCol("class")

    The default model is "image_classifier_vit_base_patch16_224", if no name is provided.

    For available pretrained models please see the Models Hub.

    Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see ViTImageClassificationTestSpec.

    References:

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Paper Abstract:

    While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

    Example

    import com.johnsnowlabs.nlp.annotator._
    import com.johnsnowlabs.nlp.ImageAssembler
    import org.apache.spark.ml.Pipeline
    
    val imageDF: DataFrame = spark.read
      .format("image")
      .option("dropInvalid", value = true)
      .load("src/test/resources/image/")
    
    val imageAssembler = new ImageAssembler()
      .setInputCol("image")
      .setOutputCol("image_assembler")
    
    val imageClassifier = ViTForImageClassification
      .pretrained()
      .setInputCols("image_assembler")
      .setOutputCol("class")
    
    val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
    val pipelineDF = pipeline.fit(imageDF).transform(imageDF)
    
    pipelineDF
      .selectExpr("reverse(split(image.origin, '/'))[0] as image_name", "class.result")
      .show(truncate = false)
    +-----------------+----------------------------------------------------------+
    |image_name       |result                                                    |
    +-----------------+----------------------------------------------------------+
    |palace.JPEG      |[palace]                                                  |
    |egyptian_cat.jpeg|[Egyptian cat]                                            |
    |hippopotamus.JPEG|[hippopotamus, hippo, river horse, Hippopotamus amphibius]|
    |hen.JPEG         |[hen]                                                     |
    |ostrich.JPEG     |[ostrich, Struthio camelus]                               |
    |junco.JPEG       |[junco, snowbird]                                         |
    |bluetick.jpg     |[bluetick]                                                |
    |chihuahua.jpg    |[Chihuahua]                                               |
    |tractor.JPEG     |[tractor]                                                 |
    |ox.JPEG          |[ox]                                                      |
    +-----------------+----------------------------------------------------------+
  43. class VisionEncoderDecoderForImageCaptioning extends AnnotatorModel[VisionEncoderDecoderForImageCaptioning] with HasBatchedAnnotateImage[VisionEncoderDecoderForImageCaptioning] with HasImageFeatureProperties with WriteTensorflowModel with WriteOnnxModel with WriteOpenvinoModel with HasEngine with HasRescaleFactor with HasGeneratorProperties

    VisionEncoderDecoder model that converts images into text captions.

    VisionEncoderDecoder model that converts images into text captions. It allows for the use of pretrained vision auto-encoding models, such as ViT, BEiT, or DeiT as the encoder, in combination with pretrained language models, like RoBERTa, GPT2, or BERT as the decoder.

    Pretrained models can be loaded with pretrained of the companion object:

    val imageClassifier = VisionEncoderDecoderForImageCaptioning.pretrained()
      .setInputCols("image_assembler")
      .setOutputCol("caption")

    The default model is "image_captioning_vit_gpt2", if no name is provided.

    For available pretrained models please see the Models Hub.

    Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see VisionEncoderDecoderTestSpec.

    Note:

    This is a very computationally expensive module especially on larger batch sizes. The use of an accelerator such as GPU is recommended.

    Example

    import com.johnsnowlabs.nlp.annotator._
    import com.johnsnowlabs.nlp.ImageAssembler
    import org.apache.spark.ml.Pipeline
    
    val imageDF: DataFrame = spark.read
      .format("image")
      .option("dropInvalid", value = true)
      .load("src/test/resources/image/")
    
    val imageAssembler = new ImageAssembler()
      .setInputCol("image")
      .setOutputCol("image_assembler")
    
    val imageCaptioning = VisionEncoderDecoderForImageCaptioning
      .pretrained()
      .setBeamSize(2)
      .setDoSample(false)
      .setInputCols("image_assembler")
      .setOutputCol("caption")
    
    val pipeline = new Pipeline().setStages(Array(imageAssembler, imageCaptioning))
    val pipelineDF = pipeline.fit(imageDF).transform(imageDF)
    
    pipelineDF
      .selectExpr("reverse(split(image.origin, '/'))[0] as image_name", "caption.result")
      .show(truncate = false)
    
    +-----------------+---------------------------------------------------------+
    |image_name       |result                                                   |
    +-----------------+---------------------------------------------------------+
    |palace.JPEG      |[a large room filled with furniture and a large window]  |
    |egyptian_cat.jpeg|[a cat laying on a couch next to another cat]            |
    |hippopotamus.JPEG|[a brown bear in a body of water]                        |
    |hen.JPEG         |[a flock of chickens standing next to each other]        |
    |ostrich.JPEG     |[a large bird standing on top of a lush green field]     |
    |junco.JPEG       |[a small bird standing on a wet ground]                  |
    |bluetick.jpg     |[a small dog standing on a wooden floor]                 |
    |chihuahua.jpg    |[a small brown dog wearing a blue sweater]               |
    |tractor.JPEG     |[a man is standing in a field with a tractor]            |
    |ox.JPEG          |[a large brown cow standing on top of a lush green field]|
    +-----------------+---------------------------------------------------------+

Value Members

  1. object BLIPForQuestionAnswering extends ReadablePretrainedBLIPForQuestionAnswering with ReadBLIPForQuestionAnsweringDLModel with Serializable
  2. object CLIPForZeroShotClassification extends ReadablePretrainedCLIPForZeroShotClassificationModel with ReadCLIPForZeroShotClassificationModel with Serializable

    This is the companion object of CLIPForZeroShotClassification.

    This is the companion object of CLIPForZeroShotClassification. Please refer to that class for the documentation.

  3. object ConvNextForImageClassification extends ReadablePretrainedConvNextForImageModel with ReadConvNextForImageDLModel with Serializable

    This is the companion object of ConvNextForImageClassification.

    This is the companion object of ConvNextForImageClassification. Please refer to that class for the documentation.

  4. object Gemma3ForMultiModal extends ReadablePretrainedGemma3ForMultiModal with ReadGemma3ForMultiModalDLModel with Serializable
  5. object JanusForMultiModal extends ReadablePretrainedJanusForMultiModal with ReadJanusForMultiModalDLModel with Serializable
  6. object LLAVAForMultiModal extends ReadablePretrainedLLAVAForMultiModal with ReadLLAVAForMultiModalDLModel with Serializable
  7. object MLLamaForMultimodal extends ReadablePretrainedMLLamaForMultimodal with ReadMLLamaForMultimodalDLModel with Serializable
  8. object PaliGemmaForMultiModal extends ReadablePretrainedPaliGemmaForMultiModal with ReadPaliGemmaForMultiModalDLModel with Serializable
  9. object Phi3Vision extends ReadablePretrainedPhi3Vision with ReadPhi3VisionDLModel with Serializable
  10. object Qwen2VLTransformer extends ReadablePretrainedQwen2VLTransformer with ReadQwen2VLTransformerDLModel with Serializable
  11. object SmolVLMTransformer extends ReadablePretrainedSmolVLMTransformer with ReadSmolVLMTransformerDLModel with Serializable
  12. object SwinForImageClassification extends ReadablePretrainedSwinForImageModel with ReadSwinForImageDLModel with Serializable

    This is the companion object of SwinForImageClassification.

    This is the companion object of SwinForImageClassification. Please refer to that class for the documentation.

  13. object ViTForImageClassification extends ReadablePretrainedViTForImageModel with ReadViTForImageDLModel with Serializable

    This is the companion object of ViTForImageClassification.

    This is the companion object of ViTForImageClassification. Please refer to that class for the documentation.

  14. object VisionEncoderDecoderForImageCaptioning extends ReadablePretrainedVisionEncoderDecoderModel with ReadVisionEncoderDecoderDLModel with Serializable

    This is the companion object of VisionEncoderDecoderForImageCaptioning.

    This is the companion object of VisionEncoderDecoderForImageCaptioning. Please refer to that class for the documentation.

Ungrouped