sparknlp.base.document_assembler#

Contains classes for the DocumentAssembler.

Module Contents#

Classes#

DocumentAssembler

Prepares data into a format that is processable by Spark NLP.

class DocumentAssembler[source]#

Prepares data into a format that is processable by Spark NLP.

This is the entry point for every Spark NLP pipeline. The DocumentAssembler reads String columns. Additionally, setCleanupMode() can be used to pre-process the text (Default: disabled). For possible options please refer the parameters section.

For more extended examples on document pre-processing see the Examples.

Input Annotation types

Output Annotation type

NONE

DOCUMENT

Parameters:
inputCol

Input column name

outputCol

Output column name

idCol

Name of String type column for row id.

metadataCol

Name of Map type column with metadata information

cleanupMode

How to cleanup the document , by default disabled. Possible values: disabled, inplace, inplace_full, shrink, shrink_full, each, each_full, delete_full

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from pyspark.ml import Pipeline
>>> data = spark.createDataFrame([["Spark NLP is an open-source text processing library."]]).toDF("text")
>>> documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
>>> result = documentAssembler.transform(data)
>>> result.select("document").show(truncate=False)
+----------------------------------------------------------------------------------------------+
|document                                                                                      |
+----------------------------------------------------------------------------------------------+
|[[document, 0, 51, Spark NLP is an open-source text processing library., [sentence -> 0], []]]|
+----------------------------------------------------------------------------------------------+
>>> result.select("document").printSchema()
root
|-- document: array (nullable = True)
|    |-- element: struct (containsNull = True)
|    |    |-- annotatorType: string (nullable = True)
|    |    |-- begin: integer (nullable = False)
|    |    |-- end: integer (nullable = False)
|    |    |-- result: string (nullable = True)
|    |    |-- metadata: map (nullable = True)
|    |    |    |-- key: string
|    |    |    |-- value: string (valueContainsNull = True)
|    |    |-- embeddings: array (nullable = True)
|    |    |    |-- element: float (containsNull = False)
setInputCol(value)[source]#

Sets input column name.

Parameters:
valuestr

Name of the input column

setOutputCol(value)[source]#

Sets output column name.

Parameters:
valuestr

Name of the Output Column

setIdCol(value)[source]#

Sets name of string type column for row id.

Parameters:
valuestr

Name of the Id Column

setMetadataCol(value)[source]#

Sets name for Map type column with metadata information.

Parameters:
valuestr

Name of the metadata column

setCleanupMode(value)[source]#

Sets how to cleanup the document, by default disabled. Possible values: disabled, inplace, inplace_full, shrink, shrink_full, each, each_full, delete_full

Parameters:
valuestr

Cleanup mode

getOutputCol()[source]#

Gets output column name of annotations.