`sparknlp.annotator.normalizer`#

Contains classes for the Normalizer.

Module Contents#

Classes#

`Normalizer`	Annotator that cleans out tokens. Requires stems, hence tokens. Removes
`NormalizerModel`	Instantiated Model of the Normalizer.

class Normalizer[source]#

Annotator that cleans out tokens. Requires stems, hence tokens. Removes all dirty characters from text following a regex pattern and transforms words based on a provided dictionary

For extended examples of usage, see the Examples.

Input Annotation types	Output Annotation type
`TOKEN`	`TOKEN`

Parameters:

cleanupPatterns: Normalization regex patterns which match will be removed from token, by default [‘[^pL+]’]
lowercase: Whether to convert strings to lowercase, by default False
slangDictionary: Slang dictionary is a delimited text. needs ‘delimiter’ in options
minLength: The minimum allowed length for each token, by default 0
maxLength: The maximum allowed length for each token

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["document"]) \
...     .setOutputCol("token")
>>> normalizer = Normalizer() \
...     .setInputCols(["token"]) \
...     .setOutputCol("normalized") \
...     .setLowercase(True) \
...     .setCleanupPatterns(["""[^\w\d\s]"""])

The pattern removes punctuations (keeps alphanumeric chars). If we don’t set CleanupPatterns, it will only keep alphabet letters ([^A-Za-z])

>>> pipeline = Pipeline().setStages([
...     documentAssembler,
...     tokenizer,
...     normalizer
... ])
>>> data = spark.createDataFrame([["John and Peter are brothers. However they don't support each other that much."]]) \
...     .toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.selectExpr("normalized.result").show(truncate = False)
+----------------------------------------------------------------------------------------+
|result                                                                                  |
+----------------------------------------------------------------------------------------+
|[john, and, peter, are, brothers, however, they, dont, support, each, other, that, much]|
+----------------------------------------------------------------------------------------+

inputAnnotatorTypes[source]#

outputAnnotatorType = 'token'[source]#

cleanupPatterns[source]#

lowercase[source]#

slangMatchCase[source]#

slangDictionary[source]#

minLength[source]#

maxLength[source]#

setCleanupPatterns(value)[source]#

Sets normalization regex patterns which match will be removed from token, by default [‘[^pL+]’].

Parameters:

valueList[str]: Normalization regex patterns which match will be removed from token

setLowercase(value)[source]#

Sets whether to convert strings to lowercase, by default False.

Parameters:

valuebool: Whether to convert strings to lowercase, by default False

setSlangDictionary(path, delimiter, read_as=ReadAs.TEXT, options={'format': 'text'})[source]#

Sets slang dictionary is a delimited text. Needs ‘delimiter’ in options.

Parameters:

pathstr: Path to the source files
delimiterstr: Delimiter for the values
read_asstr, optional: How to read the file, by default ReadAs.TEXT
optionsdict, optional: Options to read the resource, by default {“format”: “text”}

setMinLength(value)[source]#

Sets the minimum allowed length for each token, by default 0.

Parameters:

valueint: Minimum allowed length for each token.

setMaxLength(value)[source]#

Sets the maximum allowed length for each token.

Parameters:

valueint: Maximum allowed length for each token

class NormalizerModel(classname='com.johnsnowlabs.nlp.annotators.NormalizerModel', java_model=None)[source]#

Instantiated Model of the Normalizer.

This is the instantiated model of the Normalizer. For training your own model, please see the documentation of that class.

Input Annotation types	Output Annotation type
`TOKEN`	`TOKEN`

Parameters:

cleanupPatterns: normalization regex patterns which match will be removed from token
lowercase: whether to convert strings to lowercase

inputAnnotatorTypes[source]#

outputAnnotatorType = 'token'[source]#

cleanupPatterns[source]#

lowercase[source]#

slangMatchCase[source]#

name = 'NormalizerModel'[source]#

sparknlp.annotator.normalizer#

Module Contents#

Classes#

`sparknlp.annotator.normalizer`#