sparknlp.annotator.document_normalizer#

Contains classes for the DocumentNormalizer

Module Contents#

Classes#

DocumentNormalizer

Annotator which normalizes raw text from tagged text, e.g. scraped web

class DocumentNormalizer[source]#
Annotator which normalizes raw text from tagged text, e.g. scraped web

pages or xml documents, from document type columns into Sentence.

Removes all dirty characters from text following one or more input regex patterns. Can apply not wanted character removal with a specific policy. Can apply lower case normalization.

For extended examples of usage, see the `Examples <JohnSnowLabs/spark-nlp

>`__.

Input Annotation types

Output Annotation type

DOCUMENT

DOCUMENT

Parameters:
action

action to perform before applying regex patterns on text, by default “clean”

patterns

normalization regex patterns which match will be removed from document, by default [‘<[^>]*>’]

replacement

replacement string to apply when regexes match, by default “ “

lowercase

whether to convert strings to lowercase, by default False

policy

policy to remove pattern from text, by default “pretty_all”

encoding

file encoding to apply on normalized documents, by default “UTF-8”

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> cleanUpPatterns = ["<[^>]>"]
>>> documentNormalizer = DocumentNormalizer() \
...     .setInputCols("document") \
...     .setOutputCol("normalizedDocument") \
...     .setAction("clean") \
...     .setPatterns(cleanUpPatterns) \
...     .setReplacement(" ") \
...     .setPolicy("pretty_all") \
...     .setLowercase(True)
>>> pipeline = Pipeline().setStages([
...     documentAssembler,
...     documentNormalizer
... ])
>>> text = """
... <div id="theworldsgreatest" class='my-right my-hide-small my-wide toptext' style="font-family:'Segoe UI',Arial,sans-serif">
...     THE WORLD'S LARGEST WEB DEVELOPER SITE
...     <h1 style="font-size:300%;">THE WORLD'S LARGEST WEB DEVELOPER SITE</h1>
...     <p style="font-size:160%;">Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum..</p>
... </div>
... </div>"""
>>> data = spark.createDataFrame([[text]]).toDF("text")
>>> pipelineModel = pipeline.fit(data)
>>> result = pipelineModel.transform(data)
>>> result.selectExpr("normalizedDocument.result").show(truncate=False)
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[ the world's largest web developer site the world's largest web developer site lorem ipsum is simply dummy text of the printing and typesetting industry. lorem ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. it has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. it was popularised in the 1960s with the release of letraset sheets containing lorem ipsum passages, and more recently with desktop publishing software like aldus pagemaker including versions of lorem ipsum..]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
setAction(value)[source]#

Sets action to perform before applying regex patterns on text, by default “clean”.

Parameters:
valuestr

Action to perform before applying regex patterns

setPatterns(value)[source]#

Sets normalization regex patterns which match will be removed from document, by default [‘<[^>]*>’].

Parameters:
valueList[str]

Normalization regex patterns which match will be removed from document

setReplacement(value)[source]#

Sets replacement string to apply when regexes match, by default “ “.

Parameters:
valuestr

Replacement string to apply when regexes match

setLowercase(value)[source]#

Sets whether to convert strings to lowercase, by default False.

Parameters:
valuebool

Whether to convert strings to lowercase, by default False

setPolicy(value)[source]#

Sets policy to remove pattern from text, by default “pretty_all”.

Parameters:
valuestr

Policy to remove pattern from text, by default “pretty_all”

setEncoding(value)[source]#

Sets file encoding to apply on normalized documents, by default “UTF-8”.

Parameters:
valuestr

File encoding to apply on normalized documents, by default “UTF-8”