sparknlp.base.multi_document_assembler#

Module Contents#

Classes#

MultiDocumentAssembler

Prepares data into a format that is processable by Spark NLP.

class MultiDocumentAssembler[source]#

Prepares data into a format that is processable by Spark NLP.

This is the entry point for every Spark NLP pipeline. The MultiDocumentAssembler can read either a String column or an Array[String]. Additionally, setCleanupMode() can be used to pre-process the text (Default: disabled). For possible options please refer the parameters section.

For more extended examples on document pre-processing see the Examples.

Input Annotation types

Output Annotation type

NONE

DOCUMENT

Parameters:
inputCols: str or List[str]

Input column name.

outputCols: str or List[str]

Output column name.

idCol: str

Name of String type column for row id.

metadataCol: str

Name of Map type column with metadata information

cleanupMode: str

How to cleanup the document , by default disabled. Possible values: disabled, inplace, inplace_full, shrink, shrink_full, each, each_full, delete_full

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from pyspark.ml import Pipeline
>>> data = spark.createDataFrame([["Spark NLP is an open-source text processing library."], ["Spark NLP is a state-of-the-art Natural Language Processing library built on top of Apache Spark"]]).toDF("text", "text2")
>>> documentAssembler = MultiDocumentAssembler().setInputCols(["text", "text2"]).setOutputCols(["document1", "document2"])
>>> result = documentAssembler.transform(data)
>>> result.select("document1").show(truncate=False)
+----------------------------------------------------------------------------------------------+
|document1                                                                                      |
+----------------------------------------------------------------------------------------------+
|[[document, 0, 51, Spark NLP is an open-source text processing library., [sentence -> 0], []]]|
+----------------------------------------------------------------------------------------------+
>>> result.select("document1").printSchema()
root
|-- document: array (nullable = True)
|    |-- element: struct (containsNull = True)
|    |    |-- annotatorType: string (nullable = True)
|    |    |-- begin: integer (nullable = False)
|    |    |-- end: integer (nullable = False)
|    |    |-- result: string (nullable = True)
|    |    |-- metadata: map (nullable = True)
|    |    |    |-- key: string
|    |    |    |-- value: string (valueContainsNull = True)
|    |    |-- embeddings: array (nullable = True)
|    |    |    |-- element: float (containsNull = False)
setInputCols(*value)[source]#

Sets column names of input annotations.

Parameters:
*valueList[str]

Input columns for the annotator

setOutputCols(*value)[source]#

Sets column names of output annotations.

Parameters:
*valueList[str]

List of output columns

setIdCol(value)[source]#

Sets name of string type column for row id.

Parameters:
valuestr

Name of the Id Column

setMetadataCol(value)[source]#

Sets name for Map type column with metadata information.

Parameters:
valuestr

Name of the metadata column

setCleanupMode(value)[source]#

Sets how to cleanup the document, by default disabled. Possible values: disabled, inplace, inplace_full, shrink, shrink_full, each, each_full, delete_full

Parameters:
valuestr

Cleanup mode

getOutputCols()[source]#

Gets output columns name of annotations.