sparknlp.base.multi_column_assembler#

Contains classes for the MultiColumnAssembler.

Module Contents#

Classes#

MultiColumnAssembler

Merges multiple annotation columns into a single annotation column.

class MultiColumnAssembler[source]#

Merges multiple annotation columns into a single annotation column.

This is useful when multiple annotators produce separate annotation columns (e.g., document_text, document_table from ReaderAssembler) and a downstream annotator (e.g., AutoGGUFVisionModel) expects a single input column containing all annotations.

Annotations from all input columns are collected and concatenated into the output column. The output annotator type defaults to DOCUMENT but can be configured. Each annotation’s metadata is preserved, and a source_column key is added to track the original column name.

Note: All input columns must use the Annotation schema. Columns using AnnotationImage schema (e.g., IMAGE-typed columns from ReaderAssembler) are not supported.

Input Annotation types

Output Annotation type

DOCUMENT

DOCUMENT

Parameters:
inputCols

Input annotation columns to merge

outputCol

Output annotation column name

outputAsAnnotatorType

The annotator type to use for the output annotations (Default: document)

sortByBegin

Whether to sort merged annotations by their begin position (Default: False)

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler1 = DocumentAssembler() \
...     .setInputCol("text1") \
...     .setOutputCol("document_text")
>>> documentAssembler2 = DocumentAssembler() \
...     .setInputCol("text2") \
...     .setOutputCol("document_table")
>>> multiColumnAssembler = MultiColumnAssembler() \
...     .setInputCols(["document_text", "document_table"]) \
...     .setOutputCol("merged_document")
>>> data = spark.createDataFrame([("Hello world", "Name | Age")]).toDF("text1", "text2")
>>> pipeline = Pipeline().setStages([documentAssembler1, documentAssembler2, multiColumnAssembler]).fit(data)
>>> result = pipeline.transform(data)
>>> result.selectExpr("merged_document.result").show(truncate=False)
+---------------------------+
|result                     |
+---------------------------+
|[Hello world, Name | Age] |
+---------------------------+
inputAnnotatorTypes[source]#
outputAnnotatorType = 'document'[source]#
outputAsAnnotatorType[source]#
sortByBegin[source]#
name = 'MultiColumnAssembler'[source]#
setParams()[source]#
setInputCols(*value)[source]#

Sets input annotation columns to merge.

Parameters:
*valuestr

Input column names

setOutputAsAnnotatorType(value)[source]#

Sets the annotator type for the output annotations.

Parameters:
valuestr

The annotator type (e.g., “document”, “chunk”, “table”)

setSortByBegin(value)[source]#

Sets whether to sort merged annotations by begin position.

Parameters:
valuebool

Whether to sort by begin position