`sparknlp.base.multi_document_assembler`#

Module Contents#

Classes#

MultiDocumentAssembler

Prepares data into a format that is processable by Spark NLP.

class MultiDocumentAssembler[source]#

Prepares data into a format that is processable by Spark NLP.

This is the entry point for every Spark NLP pipeline. The MultiDocumentAssembler can read either a String column or an Array[String]. Additionally, setCleanupMode() can be used to pre-process the text (Default: disabled). For possible options please refer the parameters section.

For more extended examples on document pre-processing see the Examples.

Input Annotation types	Output Annotation type
`NONE`	`DOCUMENT`

Parameters:

inputCols: str or List[str]: Input column name.
outputCols: str or List[str]: Output column name.
idCol: str: Name of String type column for row id.
metadataCol: str: Name of Map type column with metadata information
cleanupMode: str: How to cleanup the document , by default disabled. Possible values: disabled, inplace, inplace_full, shrink, shrink_full, each, each_full, delete_full

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from pyspark.ml import Pipeline
>>> data = spark.createDataFrame([["Spark NLP is an open-source text processing library."], ["Spark NLP is a state-of-the-art Natural Language Processing library built on top of Apache Spark"]]).toDF("text", "text2")
>>> documentAssembler = MultiDocumentAssembler().setInputCols(["text", "text2"]).setOutputCols(["document1", "document2"])
>>> result = documentAssembler.transform(data)
>>> result.select("document1").show(truncate=False)
+----------------------------------------------------------------------------------------------+
|document1                                                                                      |
+----------------------------------------------------------------------------------------------+
|[[document, 0, 51, Spark NLP is an open-source text processing library., [sentence -> 0], []]]|
+----------------------------------------------------------------------------------------------+
>>> result.select("document1").printSchema()
root
|-- document: array (nullable = True)
|    |-- element: struct (containsNull = True)
|    |    |-- annotatorType: string (nullable = True)
|    |    |-- begin: integer (nullable = False)
|    |    |-- end: integer (nullable = False)
|    |    |-- result: string (nullable = True)
|    |    |-- metadata: map (nullable = True)
|    |    |    |-- key: string
|    |    |    |-- value: string (valueContainsNull = True)
|    |    |-- embeddings: array (nullable = True)
|    |    |    |-- element: float (containsNull = False)

outputAnnotatorType = 'document'[source]#

inputCols[source]#

outputCols[source]#

idCol[source]#

metadataCol[source]#

cleanupMode[source]#

name = 'MultiDocumentAssembler'[source]#

setParams()[source]#

setInputCols(*value)[source]#

Sets column names of input annotations.

Parameters:

*valueList[str]: Input columns for the annotator

setOutputCols(*value)[source]#

Sets column names of output annotations.

Parameters:

*valueList[str]: List of output columns

setIdCol(value)[source]#

Sets name of string type column for row id.

Parameters:

valuestr: Name of the Id Column

setMetadataCol(value)[source]#

Sets name for Map type column with metadata information.

Parameters:

valuestr: Name of the metadata column

setCleanupMode(value)[source]#

Sets how to cleanup the document, by default disabled. Possible values: disabled, inplace, inplace_full, shrink, shrink_full, each, each_full, delete_full

Parameters:

valuestr: Cleanup mode

getOutputCols()[source]#: Gets output columns name of annotations.

sparknlp.base.multi_document_assembler#

Module Contents#

Classes#

`sparknlp.base.multi_document_assembler`#