sparknlp.reader.reader2image#

Module Contents#

Classes#

Reader2Image

The Reader2Image annotator allows you to use the reading files with images more smoothly within existing

class Reader2Image[source]#

The Reader2Image annotator allows you to use the reading files with images more smoothly within existing Spark NLP workflows, enabling seamless reuse of your pipelines. Reader2Image can be used for extracting structured image content from various document types using Spark NLP readers. It supports reading from many file types and returns parsed output as a structured Spark DataFrame.

Supported formats include HTML and Markdown.

== Example == This example demonstrates how to load HTML files with images and process them into a structured Spark DataFrame using Reader2Image.

Expected output: +——————-+——————–+ | fileName| image| +——————-+——————–+ |example-images.html|[{image, example-...| |example-images.html|[{image, example-...| +——————-+——————–+

Schema: root

|– fileName: string (nullable = true) |– image: array (nullable = false) | |– element: struct (containsNull = true) | | |– annotatorType: string (nullable = true) | | |– origin: string (nullable = true) | | |– height: integer (nullable = false) | | |– width: integer (nullable = false) | | |– nChannels: integer (nullable = false) | | |– mode: integer (nullable = false) | | |– result: binary (nullable = true) | | |– metadata: map (nullable = true) | | | |– key: string | | | |– value: string (valueContainsNull = true) | | |– text: string (nullable = true)

name = 'Reader2Image'[source]#
outputAnnotatorType = 'image'[source]#
userMessage[source]#
promptTemplate[source]#
customPromptTemplate[source]#
useEncodedImageBytes[source]#
outputPromptColumn[source]#
setParams()[source]#
setUserMessage(value: str)[source]#

Sets custom user message.

Parameters:
valuestr

Custom user message to include.

setPromptTemplate(value: str)[source]#

Sets format of the output prompt.

Parameters:
valuestr

Prompt template format.

setCustomPromptTemplate(value: str)[source]#

Sets custom prompt template for image models.

Parameters:
valuestr

Custom prompt template string.

setUseEncodedImageBytes(value: bool)[source]#

Sets whether to use encoded image bytes or decoded pixels.

Parameters:
valuebool

If True, keeps the image bytes in their encoded (compressed) form. If False, decodes the image into a pixel matrix representation.

setOutputPromptColumn(value: bool)[source]#

Enables or disables creation of a prompt column.

Parameters:
valuebool

If True, adds an additional ‘prompt’ column to the output DataFrame containing the text prompt as a Spark NLP Annotation.