Phi-3-vision-128k-instruct

Description

The Phi-3-Vision-128K-Instruct is a lightweight, state-of-the-art open multimodal model built upon datasets which include - synthetic data and filtered publicly available websites - with a focus on very high-quality, reasoning dense data both on text and vision. The model belongs to the Phi-3 model family, and the multimodal version comes with 128K context length (in tokens) it can support. The model underwent a rigorous enhancement process, incorporating both supervised fine-tuning and direct preference optimization to ensure precise instruction adherence and robust safety measures.

Predicted Entities

Download Copy S3 URI

How to use

url1 = "https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/d5fbbd1a-d484-415c-88cb-9986625b7b11"
url2 = "http://images.cocodataset.org/val2017/000000039769.jpg"

Path("images").mkdir(exist_ok=True)

!wget -q -O images/image1.jpg {url1}
!wget -q -O images/image2.jpg {url2}



images_path = "file://" + os.getcwd() + "/images/"
image_df = spark.read.format("image").load(
    path=images_path
)

test_df = image_df.withColumn("text", lit("<|user|> \n <|image_1|> \n What's this picture about? <|end|>\n <|assistant|>\n"))

image_assembler = ImageAssembler().setInputCol("image").setOutputCol("image_assembler")

imageClassifier = Phi3Vision.pretrained("phi_3_vision_128k_instruct","en")\
            .setMaxOutputLength(50) \
            .setInputCols("image_assembler") \
            .setOutputCol("answer")

pipeline = Pipeline(
            stages=[
                image_assembler,
                imageClassifier,
            ]
        )

results = pipeline.fit(test_df).transform(test_df)
val imageFolder = "src/test/resources/image/"
    val imageDF: DataFrame = ResourceHelper.spark.read
      .format("image")
      .option("dropInvalid", value = true)
      .load(imageFolder)

    val testDF: DataFrame = imageDF.withColumn(
      "text",
      lit("<|user|> \n <|image_1|> \n What's this picture about? <|end|>\n <|assistant|>\n"))

val imageAssembler: ImageAssembler = new ImageAssembler()
      .setInputCol("image")
      .setOutputCol("image_assembler")

val loadModel = Phi3Vision
      .pretrained("phi_3_vision_128k_instruct","en")
      .setInputCols("image_assembler")
      .setOutputCol("answer")
      .setMaxOutputLength(50)

    val newPipeline: Pipeline =
      new Pipeline().setStages(Array(imageAssembler, loadModel))

    newPipeline.fit(testDF).transform(testDF).show()

Model Information

Model Name: phi_3_vision_128k_instruct
Compatibility: Spark NLP 5.5.1+
License: Open Source
Edition: Official
Input Labels: [image_assembler]
Output Labels: [answer]
Language: en
Size: 3.3 GB

References

https://huggingface.co/microsoft/Phi-3-vision-128k-instruct