`sparknlp.base.multi_column_assembler`#

Contains classes for the MultiColumnAssembler.

Module Contents#

Classes#

MultiColumnAssembler

Merges multiple annotation columns into a single annotation column.

class MultiColumnAssembler[source]#

Merges multiple annotation columns into a single annotation column.

This is useful when multiple annotators produce separate annotation columns (e.g., document_text, document_table from ReaderAssembler) and a downstream annotator (e.g., AutoGGUFVisionModel) expects a single input column containing all annotations.

Annotations from all input columns are collected and concatenated into the output column. The output annotator type defaults to DOCUMENT but can be configured. Each annotation’s metadata is preserved, and a source_column key is added to track the original column name.

Note: All input columns must use the Annotation schema. Columns using AnnotationImage schema (e.g., IMAGE-typed columns from ReaderAssembler) are not supported.

Input Annotation types	Output Annotation type
`DOCUMENT`	`DOCUMENT`

Parameters:

inputCols: Input annotation columns to merge
outputCol: Output annotation column name
outputAsAnnotatorType: The annotator type to use for the output annotations (Default: document)
sortByBegin: Whether to sort merged annotations by their begin position (Default: False)

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler1 = DocumentAssembler() \
...     .setInputCol("text1") \
...     .setOutputCol("document_text")
>>> documentAssembler2 = DocumentAssembler() \
...     .setInputCol("text2") \
...     .setOutputCol("document_table")
>>> multiColumnAssembler = MultiColumnAssembler() \
...     .setInputCols(["document_text", "document_table"]) \
...     .setOutputCol("merged_document")
>>> data = spark.createDataFrame([("Hello world", "Name | Age")]).toDF("text1", "text2")
>>> pipeline = Pipeline().setStages([documentAssembler1, documentAssembler2, multiColumnAssembler]).fit(data)
>>> result = pipeline.transform(data)
>>> result.selectExpr("merged_document.result").show(truncate=False)
+---------------------------+
|result                     |
+---------------------------+
|[Hello world, Name | Age] |
+---------------------------+

inputAnnotatorTypes[source]#

outputAnnotatorType = 'document'[source]#

outputAsAnnotatorType[source]#

sortByBegin[source]#

name = 'MultiColumnAssembler'[source]#

setParams()[source]#

setInputCols(*value)[source]#

Sets input annotation columns to merge.

Parameters:

*valuestr: Input column names

setOutputAsAnnotatorType(value)[source]#

Sets the annotator type for the output annotations.

Parameters:

valuestr: The annotator type (e.g., “document”, “chunk”, “table”)

setSortByBegin(value)[source]#

Sets whether to sort merged annotations by begin position.

Parameters:

valuebool: Whether to sort by begin position

sparknlp.base.multi_column_assembler#

Module Contents#

Classes#

`sparknlp.base.multi_column_assembler`#