sparknlp.base.token_assembler
#
Contains classes for the TokenAssembler.
Module Contents#
Classes#
This transformer reconstructs a |
- class TokenAssembler[source]#
This transformer reconstructs a
DOCUMENT
type annotation from tokens, usually after these have been normalized, lemmatized, normalized, spell checked, etc, in order to use this document annotation in further annotators. RequiresDOCUMENT
andTOKEN
type annotations as input.For more extended examples on document pre-processing see the Examples.
Input Annotation types
Output Annotation type
DOCUMENT, TOKEN
DOCUMENT
- Parameters:
- preservePosition
Whether to preserve the actual position of the tokens or reduce them to one space
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline
First, the text is tokenized and cleaned
>>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> sentenceDetector = SentenceDetector() \ ... .setInputCols(["document"]) \ ... .setOutputCol("sentences") >>> tokenizer = Tokenizer() \ ... .setInputCols(["sentences"]) \ ... .setOutputCol("token") >>> normalizer = Normalizer() \ ... .setInputCols(["token"]) \ ... .setOutputCol("normalized") \ ... .setLowercase(False) >>> stopwordsCleaner = StopWordsCleaner() \ ... .setInputCols(["normalized"]) \ ... .setOutputCol("cleanTokens") \ ... .setCaseSensitive(False)
Then the TokenAssembler turns the cleaned tokens into a
DOCUMENT
type structure.>>> tokenAssembler = TokenAssembler() \ ... .setInputCols(["sentences", "cleanTokens"]) \ ... .setOutputCol("cleanText") >>> data = spark.createDataFrame([["Spark NLP is an open-source text processing library for advanced natural language processing."]]) \ ... .toDF("text") >>> pipeline = Pipeline().setStages([ ... documentAssembler, ... sentenceDetector, ... tokenizer, ... normalizer, ... stopwordsCleaner, ... tokenAssembler ... ]).fit(data) >>> result = pipeline.transform(data) >>> result.select("cleanText").show(truncate=False) +---------------------------------------------------------------------------------------------------------------------------+ |cleanText | +---------------------------------------------------------------------------------------------------------------------------+ |[[document, 0, 80, Spark NLP opensource text processing library advanced natural language processing, [sentence -> 0], []]]| +---------------------------------------------------------------------------------------------------------------------------+