sparknlp.base.multi_document_assembler
#
Module Contents#
Classes#
Prepares data into a format that is processable by Spark NLP. |
- class MultiDocumentAssembler[source]#
Prepares data into a format that is processable by Spark NLP.
This is the entry point for every Spark NLP pipeline. The MultiDocumentAssembler can read either a
String
column or anArray[String]
. Additionally,setCleanupMode()
can be used to pre-process the text (Default:disabled
). For possible options please refer the parameters section.For more extended examples on document pre-processing see the Examples.
Input Annotation types
Output Annotation type
NONE
DOCUMENT
- Parameters:
- inputCols: str or List[str]
Input column name.
- outputCols: str or List[str]
Output column name.
- idCol: str
Name of String type column for row id.
- metadataCol: str
Name of Map type column with metadata information
- cleanupMode: str
How to cleanup the document , by default disabled. Possible values:
disabled, inplace, inplace_full, shrink, shrink_full, each, each_full, delete_full
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from pyspark.ml import Pipeline >>> data = spark.createDataFrame([["Spark NLP is an open-source text processing library."], ["Spark NLP is a state-of-the-art Natural Language Processing library built on top of Apache Spark"]]).toDF("text", "text2") >>> documentAssembler = MultiDocumentAssembler().setInputCols(["text", "text2"]).setOutputCols(["document1", "document2"]) >>> result = documentAssembler.transform(data) >>> result.select("document1").show(truncate=False) +----------------------------------------------------------------------------------------------+ |document1 | +----------------------------------------------------------------------------------------------+ |[[document, 0, 51, Spark NLP is an open-source text processing library., [sentence -> 0], []]]| +----------------------------------------------------------------------------------------------+ >>> result.select("document1").printSchema() root |-- document: array (nullable = True) | |-- element: struct (containsNull = True) | | |-- annotatorType: string (nullable = True) | | |-- begin: integer (nullable = False) | | |-- end: integer (nullable = False) | | |-- result: string (nullable = True) | | |-- metadata: map (nullable = True) | | | |-- key: string | | | |-- value: string (valueContainsNull = True) | | |-- embeddings: array (nullable = True) | | | |-- element: float (containsNull = False)
- setInputCols(*value)[source]#
Sets column names of input annotations.
- Parameters:
- *valueList[str]
Input columns for the annotator
- setOutputCols(*value)[source]#
Sets column names of output annotations.
- Parameters:
- *valueList[str]
List of output columns
- setIdCol(value)[source]#
Sets name of string type column for row id.
- Parameters:
- valuestr
Name of the Id Column
- setMetadataCol(value)[source]#
Sets name for Map type column with metadata information.
- Parameters:
- valuestr
Name of the metadata column