sparknlp.annotator.normalizer#
Contains classes for the Normalizer.
Module Contents#
Classes#
Annotator that cleans out tokens. Requires stems, hence tokens. Removes |
|
Instantiated Model of the Normalizer. |
- class Normalizer[source]#
Annotator that cleans out tokens. Requires stems, hence tokens. Removes all dirty characters from text following a regex pattern and transforms words based on a provided dictionary
For extended examples of usage, see the Examples.
Input Annotation types
Output Annotation type
TOKENTOKEN- Parameters:
- cleanupPatterns
Normalization regex patterns which match will be removed from token, by default [‘[^pL+]’]
- lowercase
Whether to convert strings to lowercase, by default False
- slangDictionary
Slang dictionary is a delimited text. needs ‘delimiter’ in options
- minLength
The minimum allowed length for each token, by default 0
- maxLength
The maximum allowed length for each token
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> tokenizer = Tokenizer() \ ... .setInputCols(["document"]) \ ... .setOutputCol("token") >>> normalizer = Normalizer() \ ... .setInputCols(["token"]) \ ... .setOutputCol("normalized") \ ... .setLowercase(True) \ ... .setCleanupPatterns(["""[^\w\d\s]"""])
The pattern removes punctuations (keeps alphanumeric chars). If we don’t set CleanupPatterns, it will only keep alphabet letters ([^A-Za-z])
>>> pipeline = Pipeline().setStages([ ... documentAssembler, ... tokenizer, ... normalizer ... ]) >>> data = spark.createDataFrame([["John and Peter are brothers. However they don't support each other that much."]]) \ ... .toDF("text") >>> result = pipeline.fit(data).transform(data) >>> result.selectExpr("normalized.result").show(truncate = False) +----------------------------------------------------------------------------------------+ |result | +----------------------------------------------------------------------------------------+ |[john, and, peter, are, brothers, however, they, dont, support, each, other, that, much]| +----------------------------------------------------------------------------------------+
- setCleanupPatterns(value)[source]#
Sets normalization regex patterns which match will be removed from token, by default [‘[^pL+]’].
- Parameters:
- valueList[str]
Normalization regex patterns which match will be removed from token
- setLowercase(value)[source]#
Sets whether to convert strings to lowercase, by default False.
- Parameters:
- valuebool
Whether to convert strings to lowercase, by default False
- setSlangDictionary(path, delimiter, read_as=ReadAs.TEXT, options={'format': 'text'})[source]#
Sets slang dictionary is a delimited text. Needs ‘delimiter’ in options.
- Parameters:
- pathstr
Path to the source files
- delimiterstr
Delimiter for the values
- read_asstr, optional
How to read the file, by default ReadAs.TEXT
- optionsdict, optional
Options to read the resource, by default {“format”: “text”}
- class NormalizerModel(classname='com.johnsnowlabs.nlp.annotators.NormalizerModel', java_model=None)[source]#
Instantiated Model of the Normalizer.
This is the instantiated model of the
Normalizer. For training your own model, please see the documentation of that class.Input Annotation types
Output Annotation type
TOKENTOKEN- Parameters:
- cleanupPatterns
normalization regex patterns which match will be removed from token
- lowercase
whether to convert strings to lowercase