sparknlp.annotator.normalizer#

Contains classes for the Normalizer.

Module Contents#

Classes#

Normalizer

Annotator that cleans out tokens. Requires stems, hence tokens. Removes

NormalizerModel

Instantiated Model of the Normalizer.

class Normalizer[source]#

Annotator that cleans out tokens. Requires stems, hence tokens. Removes all dirty characters from text following a regex pattern and transforms words based on a provided dictionary

For extended examples of usage, see the Examples.

Input Annotation types

Output Annotation type

TOKEN

TOKEN

Parameters:
cleanupPatterns

Normalization regex patterns which match will be removed from token, by default [‘[^pL+]’]

lowercase

Whether to convert strings to lowercase, by default False

slangDictionary

Slang dictionary is a delimited text. needs ‘delimiter’ in options

minLength

The minimum allowed length for each token, by default 0

maxLength

The maximum allowed length for each token

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["document"]) \
...     .setOutputCol("token")
>>> normalizer = Normalizer() \
...     .setInputCols(["token"]) \
...     .setOutputCol("normalized") \
...     .setLowercase(True) \
...     .setCleanupPatterns(["""[^\w\d\s]"""])

The pattern removes punctuations (keeps alphanumeric chars). If we don’t set CleanupPatterns, it will only keep alphabet letters ([^A-Za-z])

>>> pipeline = Pipeline().setStages([
...     documentAssembler,
...     tokenizer,
...     normalizer
... ])
>>> data = spark.createDataFrame([["John and Peter are brothers. However they don't support each other that much."]]) \
...     .toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.selectExpr("normalized.result").show(truncate = False)
+----------------------------------------------------------------------------------------+
|result                                                                                  |
+----------------------------------------------------------------------------------------+
|[john, and, peter, are, brothers, however, they, dont, support, each, other, that, much]|
+----------------------------------------------------------------------------------------+
setCleanupPatterns(value)[source]#

Sets normalization regex patterns which match will be removed from token, by default [‘[^pL+]’].

Parameters:
valueList[str]

Normalization regex patterns which match will be removed from token

setLowercase(value)[source]#

Sets whether to convert strings to lowercase, by default False.

Parameters:
valuebool

Whether to convert strings to lowercase, by default False

setSlangDictionary(path, delimiter, read_as=ReadAs.TEXT, options={'format': 'text'})[source]#

Sets slang dictionary is a delimited text. Needs ‘delimiter’ in options.

Parameters:
pathstr

Path to the source files

delimiterstr

Delimiter for the values

read_asstr, optional

How to read the file, by default ReadAs.TEXT

optionsdict, optional

Options to read the resource, by default {“format”: “text”}

setMinLength(value)[source]#

Sets the minimum allowed length for each token, by default 0.

Parameters:
valueint

Minimum allowed length for each token.

setMaxLength(value)[source]#

Sets the maximum allowed length for each token.

Parameters:
valueint

Maximum allowed length for each token

class NormalizerModel(classname='com.johnsnowlabs.nlp.annotators.NormalizerModel', java_model=None)[source]#

Instantiated Model of the Normalizer.

This is the instantiated model of the Normalizer. For training your own model, please see the documentation of that class.

Input Annotation types

Output Annotation type

TOKEN

TOKEN

Parameters:
cleanupPatterns

normalization regex patterns which match will be removed from token

lowercase

whether to convert strings to lowercase