sparknlp.base.multi_column_assembler#
Contains classes for the MultiColumnAssembler.
Module Contents#
Classes#
Merges multiple annotation columns into a single annotation column. |
- class MultiColumnAssembler[source]#
Merges multiple annotation columns into a single annotation column.
This is useful when multiple annotators produce separate annotation columns (e.g.,
document_text,document_tablefromReaderAssembler) and a downstream annotator (e.g.,AutoGGUFVisionModel) expects a single input column containing all annotations.Annotations from all input columns are collected and concatenated into the output column. The output annotator type defaults to
DOCUMENTbut can be configured. Each annotation’s metadata is preserved, and asource_columnkey is added to track the original column name.Note: All input columns must use the
Annotationschema. Columns usingAnnotationImageschema (e.g., IMAGE-typed columns fromReaderAssembler) are not supported.Input Annotation types
Output Annotation type
DOCUMENTDOCUMENT- Parameters:
- inputCols
Input annotation columns to merge
- outputCol
Output annotation column name
- outputAsAnnotatorType
The annotator type to use for the output annotations (Default:
document)- sortByBegin
Whether to sort merged annotations by their begin position (Default:
False)
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from pyspark.ml import Pipeline >>> documentAssembler1 = DocumentAssembler() \ ... .setInputCol("text1") \ ... .setOutputCol("document_text") >>> documentAssembler2 = DocumentAssembler() \ ... .setInputCol("text2") \ ... .setOutputCol("document_table") >>> multiColumnAssembler = MultiColumnAssembler() \ ... .setInputCols(["document_text", "document_table"]) \ ... .setOutputCol("merged_document") >>> data = spark.createDataFrame([("Hello world", "Name | Age")]).toDF("text1", "text2") >>> pipeline = Pipeline().setStages([documentAssembler1, documentAssembler2, multiColumnAssembler]).fit(data) >>> result = pipeline.transform(data) >>> result.selectExpr("merged_document.result").show(truncate=False) +---------------------------+ |result | +---------------------------+ |[Hello world, Name | Age] | +---------------------------+
- setInputCols(*value)[source]#
Sets input annotation columns to merge.
- Parameters:
- *valuestr
Input column names