sparknlp.annotator.normalizer
#
Contains classes for the Normalizer.
Module Contents#
Classes#
Annotator that cleans out tokens. Requires stems, hence tokens. Removes |
|
Instantiated Model of the Normalizer. |
- class Normalizer[source]#
Annotator that cleans out tokens. Requires stems, hence tokens. Removes all dirty characters from text following a regex pattern and transforms words based on a provided dictionary
For extended examples of usage, see the Examples.
Input Annotation types
Output Annotation type
TOKEN
TOKEN
- Parameters:
- cleanupPatterns
Normalization regex patterns which match will be removed from token, by default [‘[^pL+]’]
- lowercase
Whether to convert strings to lowercase, by default False
- slangDictionary
Slang dictionary is a delimited text. needs ‘delimiter’ in options
- minLength
The minimum allowed length for each token, by default 0
- maxLength
The maximum allowed length for each token
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> tokenizer = Tokenizer() \ ... .setInputCols(["document"]) \ ... .setOutputCol("token") >>> normalizer = Normalizer() \ ... .setInputCols(["token"]) \ ... .setOutputCol("normalized") \ ... .setLowercase(True) \ ... .setCleanupPatterns(["""[^\w\d\s]"""])
The pattern removes punctuations (keeps alphanumeric chars). If we don’t set CleanupPatterns, it will only keep alphabet letters ([^A-Za-z])
>>> pipeline = Pipeline().setStages([ ... documentAssembler, ... tokenizer, ... normalizer ... ]) >>> data = spark.createDataFrame([["John and Peter are brothers. However they don't support each other that much."]]) \ ... .toDF("text") >>> result = pipeline.fit(data).transform(data) >>> result.selectExpr("normalized.result").show(truncate = False) +----------------------------------------------------------------------------------------+ |result | +----------------------------------------------------------------------------------------+ |[john, and, peter, are, brothers, however, they, dont, support, each, other, that, much]| +----------------------------------------------------------------------------------------+
- setCleanupPatterns(value)[source]#
Sets normalization regex patterns which match will be removed from token, by default [‘[^pL+]’].
- Parameters:
- valueList[str]
Normalization regex patterns which match will be removed from token
- setLowercase(value)[source]#
Sets whether to convert strings to lowercase, by default False.
- Parameters:
- valuebool
Whether to convert strings to lowercase, by default False
- setSlangDictionary(path, delimiter, read_as=ReadAs.TEXT, options={'format': 'text'})[source]#
Sets slang dictionary is a delimited text. Needs ‘delimiter’ in options.
- Parameters:
- pathstr
Path to the source files
- delimiterstr
Delimiter for the values
- read_asstr, optional
How to read the file, by default ReadAs.TEXT
- optionsdict, optional
Options to read the resource, by default {“format”: “text”}
- class NormalizerModel(classname='com.johnsnowlabs.nlp.annotators.NormalizerModel', java_model=None)[source]#
Instantiated Model of the Normalizer.
This is the instantiated model of the
Normalizer
. For training your own model, please see the documentation of that class.Input Annotation types
Output Annotation type
TOKEN
TOKEN
- Parameters:
- cleanupPatterns
normalization regex patterns which match will be removed from token
- lowercase
whether to convert strings to lowercase