`sparknlp.base.document_assembler`#

Contains classes for the DocumentAssembler.

Module Contents#

Classes#

DocumentAssembler

Prepares data into a format that is processable by Spark NLP.

class DocumentAssembler[source]#

Prepares data into a format that is processable by Spark NLP.

This is the entry point for every Spark NLP pipeline. The DocumentAssembler reads String columns. Additionally, setCleanupMode() can be used to pre-process the text (Default: disabled). For possible options please refer the parameters section.

For more extended examples on document pre-processing see the Examples.

Input Annotation types	Output Annotation type
`NONE`	`DOCUMENT`

Parameters:

inputCol: Input column name
outputCol: Output column name
idCol: Name of String type column for row id.
metadataCol: Name of Map type column with metadata information
cleanupMode: How to cleanup the document , by default disabled. Possible values: disabled, inplace, inplace_full, shrink, shrink_full, each, each_full, delete_full

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from pyspark.ml import Pipeline
>>> data = spark.createDataFrame([["Spark NLP is an open-source text processing library."]]).toDF("text")
>>> documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
>>> result = documentAssembler.transform(data)
>>> result.select("document").show(truncate=False)
+----------------------------------------------------------------------------------------------+
|document                                                                                      |
+----------------------------------------------------------------------------------------------+
|[[document, 0, 51, Spark NLP is an open-source text processing library., [sentence -> 0], []]]|
+----------------------------------------------------------------------------------------------+
>>> result.select("document").printSchema()
root
|-- document: array (nullable = True)
|    |-- element: struct (containsNull = True)
|    |    |-- annotatorType: string (nullable = True)
|    |    |-- begin: integer (nullable = False)
|    |    |-- end: integer (nullable = False)
|    |    |-- result: string (nullable = True)
|    |    |-- metadata: map (nullable = True)
|    |    |    |-- key: string
|    |    |    |-- value: string (valueContainsNull = True)
|    |    |-- embeddings: array (nullable = True)
|    |    |    |-- element: float (containsNull = False)

outputAnnotatorType = 'document'[source]#

inputCol[source]#

outputCol[source]#

idCol[source]#

metadataCol[source]#

cleanupMode[source]#

name = 'DocumentAssembler'[source]#

setParams()[source]#

setInputCol(value)[source]#

Sets input column name.

Parameters:

valuestr: Name of the input column

setOutputCol(value)[source]#

Sets output column name.

Parameters:

valuestr: Name of the Output Column

setIdCol(value)[source]#

Sets name of string type column for row id.

Parameters:

valuestr: Name of the Id Column

setMetadataCol(value)[source]#

Sets name for Map type column with metadata information.

Parameters:

valuestr: Name of the metadata column

setCleanupMode(value)[source]#

Sets how to cleanup the document, by default disabled. Possible values: disabled, inplace, inplace_full, shrink, shrink_full, each, each_full, delete_full

Parameters:

valuestr: Cleanup mode

getOutputCol()[source]#: Gets output column name of annotations.

sparknlp.base.document_assembler#

Module Contents#

Classes#

`sparknlp.base.document_assembler`#