sparknlp.base.document_assembler
#
Contains classes for the DocumentAssembler.
Module Contents#
Classes#
Prepares data into a format that is processable by Spark NLP. |
- class DocumentAssembler[source]#
Prepares data into a format that is processable by Spark NLP.
This is the entry point for every Spark NLP pipeline. The DocumentAssembler reads
String
columns. Additionally,setCleanupMode()
can be used to pre-process the text (Default:disabled
). For possible options please refer the parameters section.For more extended examples on document pre-processing see the Examples.
Input Annotation types
Output Annotation type
NONE
DOCUMENT
- Parameters:
- inputCol
Input column name
- outputCol
Output column name
- idCol
Name of String type column for row id.
- metadataCol
Name of Map type column with metadata information
- cleanupMode
How to cleanup the document , by default disabled. Possible values:
disabled, inplace, inplace_full, shrink, shrink_full, each, each_full, delete_full
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from pyspark.ml import Pipeline >>> data = spark.createDataFrame([["Spark NLP is an open-source text processing library."]]).toDF("text") >>> documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document") >>> result = documentAssembler.transform(data) >>> result.select("document").show(truncate=False) +----------------------------------------------------------------------------------------------+ |document | +----------------------------------------------------------------------------------------------+ |[[document, 0, 51, Spark NLP is an open-source text processing library., [sentence -> 0], []]]| +----------------------------------------------------------------------------------------------+ >>> result.select("document").printSchema() root |-- document: array (nullable = True) | |-- element: struct (containsNull = True) | | |-- annotatorType: string (nullable = True) | | |-- begin: integer (nullable = False) | | |-- end: integer (nullable = False) | | |-- result: string (nullable = True) | | |-- metadata: map (nullable = True) | | | |-- key: string | | | |-- value: string (valueContainsNull = True) | | |-- embeddings: array (nullable = True) | | | |-- element: float (containsNull = False)
- setIdCol(value)[source]#
Sets name of string type column for row id.
- Parameters:
- valuestr
Name of the Id Column
- setMetadataCol(value)[source]#
Sets name for Map type column with metadata information.
- Parameters:
- valuestr
Name of the metadata column