`sparknlp.reader.reader_assembler`#

Module Contents#

Classes#

ReaderAssembler

The ReaderAssembler annotator provides a unified interface for combining multiple Spark NLP

class ReaderAssembler[source]#

The ReaderAssembler annotator provides a unified interface for combining multiple Spark NLP readers (such as Reader2Doc, Reader2Table, and Reader2Image) into a single, configurable component. It automatically orchestrates the execution of different readers based on input type, configured priorities, and fallback strategies allowing you to handle diverse content formats without manually chaining multiple readers in your pipeline.

ReaderAssembler simplifies the process of building flexible pipelines capable of ingesting and processing documents, tables, and images in a consistent way. It handles reader selection, ordering, and fault-tolerance internally, ensuring that pipelines remain concise, robust, and easy to maintain.

Examples

>>> from johnsnowlabs.reader import ReaderAssembler
>>> from pyspark.ml import Pipeline
>>>
>>> reader_assembler = ReaderAssembler() \
...     .setContentType("text/html") \
...     .setContentPath("/table-image.html") \
...     .setOutputCol("document")
>>>
>>> pipeline = Pipeline(stages=[reader_assembler])
>>> pipeline_model = pipeline.fit(empty_data_set)
>>> result_df = pipeline_model.transform(empty_data_set)
>>>
>>> result_df.show()
+--------+--------------------+--------------------+--------------------+---------+
|fileName|       document_text|      document_table|      document_image|exception|
+--------+--------------------+--------------------+--------------------+---------+
|    null|[{'document', 0, 26...|[{'document', 0, 50...|[{'image', , 5, 5, ...|     null|
+--------+--------------------+--------------------+--------------------+---------+

This annotator is especially useful when working with heterogeneous input data — for example, when a dataset includes PDFs, spreadsheets, and images — allowing Spark NLP to automatically invoke the appropriate reader for each file type while preserving a unified schema in the output.

name = 'ReaderAssembler'[source]#

outputAnnotatorType = 'document'[source]#

excludeNonText[source]#

userMessage[source]#

promptTemplate[source]#

customPromptTemplate[source]#

setParams()[source]#

setExcludeNonText(value)[source]#

Sets whether to exclude non-text content from the output.

Parameters:

valuebool: Whether to exclude non-text content from the output. Default is False.

setUserMessage(value: str)[source]#

Sets custom user message.

Parameters:

valuestr: Custom user message to include.

setPromptTemplate(value: str)[source]#

Sets format of the output prompt.

Parameters:

valuestr: Prompt template format.

setCustomPromptTemplate(value: str)[source]#

Sets custom prompt template for image models.

Parameters:

valuestr: Custom prompt template string.

sparknlp.reader.reader_assembler#

Module Contents#

Classes#

`sparknlp.reader.reader_assembler`#