Description
Visual Question Answering using InternVL.
InternVLForMultiModal can load InternVL Vision models for visual question answering. The model consists of a vision encoder, a text encoder, a text decoder and a model merger. The vision encoder will encode the input image, the text encoder will encode the input text, the model merger will merge the image and text embeddings, and the text decoder will output the answer.
InternVL 2.5 is an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data quality. Key features include:
- Large context window support
- Multilingual support
- Multimodal capabilities handling both text and image inputs
- Optimized for deployment with int4 quantization
Predicted Entities
How to use
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
from pyspark.sql.functions import lit
image_df = spark.read.format("image").load(path=images_path) # Replace with your image path
test_df = image_df.withColumn("text", lit("<|im_start|><image>\nDescribe this image in detail.<|im_end|><|im_start|>assistant\n"))
imageAssembler = ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
visualQAClassifier = InternVLForMultiModal.pretrained()
.setInputCols("image_assembler")
.setOutputCol("answer")
pipeline = Pipeline().setStages([
imageAssembler,
visualQAClassifier
])
result = pipeline.fit(test_df).transform(test_df)
result.select("image_assembler.origin", "answer.result").show(False)
import spark.implicits._
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.lit
val imageFolder = "path/to/your/images" // Replace with your image path
val imageDF: DataFrame = spark.read
.format("image")
.option("dropInvalid", value = true)
.load(imageFolder)
val testDF: DataFrame = imageDF.withColumn("text", lit("<|im_start|><image>\nDescribe this image in detail.<|im_end|><|im_start|>assistant\n"))
val imageAssembler: ImageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val visualQAClassifier = InternVLForMultiModal.pretrained()
.setInputCols("image_assembler")
.setOutputCol("answer")
val pipeline = new Pipeline().setStages(Array(
imageAssembler,
visualQAClassifier
))
val result = pipeline.fit(testDF).transform(testDF)
result.select("image_assembler.origin", "answer.result").show(false)
Model Information
Model Name: | internvl2_5_1b_int4 |
Compatibility: | Spark NLP 5.5.1+ |
License: | Open Source |
Edition: | Official |
Input Labels: | [image_assembler] |
Output Labels: | [answer] |
Language: | en |
Size: | 697.9 MB |
PREVIOUSInternVL 2 1B int4