`sparknlp.annotator.document_normalizer`#

Contains classes for the DocumentNormalizer

Module Contents#

Classes#

DocumentNormalizer

Annotator which normalizes raw text from tagged text, e.g. scraped web

class DocumentNormalizer[source]#

Annotator which normalizes raw text from tagged text, e.g. scraped web

pages or xml documents, from document type columns into Sentence.

Removes all dirty characters from text following one or more input regex patterns. Can apply not wanted character removal with a specific policy. Can apply lower case normalization.

For extended examples of usage, see the `Examples <JohnSnowLabs/spark-nlp

>`__.

Input Annotation types

Output Annotation type

DOCUMENT

DOCUMENT

Input Annotation types	Output Annotation type
`DOCUMENT`	`DOCUMENT`

Parameters:

action: action to perform before applying regex patterns on text, by default “clean”
patterns: normalization regex patterns which match will be removed from document, by default [‘<[^>]*>’]
replacement: replacement string to apply when regexes match, by default “ “
lowercase: whether to convert strings to lowercase, by default False
policy: policy to remove pattern from text, by default “pretty_all”
encoding: file encoding to apply on normalized documents, by default “UTF-8”

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> cleanUpPatterns = ["<[^>]>"]
>>> documentNormalizer = DocumentNormalizer() \
...     .setInputCols("document") \
...     .setOutputCol("normalizedDocument") \
...     .setAction("clean") \
...     .setPatterns(cleanUpPatterns) \
...     .setReplacement(" ") \
...     .setPolicy("pretty_all") \
...     .setLowercase(True)
>>> pipeline = Pipeline().setStages([
...     documentAssembler,
...     documentNormalizer
... ])
>>> text = """
... <div id="theworldsgreatest" class='my-right my-hide-small my-wide toptext' style="font-family:'Segoe UI',Arial,sans-serif">
...     THE WORLD'S LARGEST WEB DEVELOPER SITE
...     <h1 style="font-size:300%;">THE WORLD'S LARGEST WEB DEVELOPER SITE</h1>
...     <p style="font-size:160%;">Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum..</p>
... </div>
... </div>"""
>>> data = spark.createDataFrame([[text]]).toDF("text")
>>> pipelineModel = pipeline.fit(data)
>>> result = pipelineModel.transform(data)
>>> result.selectExpr("normalizedDocument.result").show(truncate=False)
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[ the world's largest web developer site the world's largest web developer site lorem ipsum is simply dummy text of the printing and typesetting industry. lorem ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. it has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. it was popularised in the 1960s with the release of letraset sheets containing lorem ipsum passages, and more recently with desktop publishing software like aldus pagemaker including versions of lorem ipsum..]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

inputAnnotatorTypes[source]#

outputAnnotatorType = 'document'[source]#

action[source]#

patterns[source]#

replacement[source]#

lowercase[source]#

policy[source]#

encoding[source]#

presetPattern[source]#

autoMode[source]#

setAction(value)[source]#

Sets action to perform before applying regex patterns on text, by default “clean”.

Parameters:

valuestr: Action to perform before applying regex patterns

setPatterns(value)[source]#

Sets normalization regex patterns which match will be removed from document, by default [‘<[^>]*>’].

Parameters:

valueList[str]: Normalization regex patterns which match will be removed from document

setReplacement(value)[source]#

Sets replacement string to apply when regexes match, by default “ “.

Parameters:

valuestr: Replacement string to apply when regexes match

setLowercase(value)[source]#

Sets whether to convert strings to lowercase, by default False.

Parameters:

valuebool: Whether to convert strings to lowercase, by default False

setPolicy(value)[source]#

Sets policy to remove pattern from text, by default “pretty_all”.

Parameters:

valuestr: Policy to remove pattern from text, by default “pretty_all”

setEncoding(value)[source]#

Sets file encoding to apply on normalized documents, by default “UTF-8”.

Parameters:

valuestr: File encoding to apply on normalized documents, by default “UTF-8”

setPresetPattern(value)[source]#

Sets a single text cleaning preset pattern.

Parameters:

valuestr: Preset cleaning pattern name, e.g., ‘CLEAN_BULLETS’, ‘CLEAN_DASHES’.

setAutoMode(value)[source]#

Sets an automatic text cleaning mode using predefined groups of cleaning functions.

Parameters:

valuestr: Auto cleaning mode, e.g., ‘light_clean’, ‘document_clean’, ‘social_clean’, ‘html_clean’, ‘full_auto’.

sparknlp.annotator.document_normalizer#

Module Contents#

Classes#

`sparknlp.annotator.document_normalizer`#