sparknlp.annotator.document_normalizer
#
Contains classes for the DocumentNormalizer
Module Contents#
Classes#
Annotator which normalizes raw text from tagged text, e.g. scraped web |
- class DocumentNormalizer[source]#
- Annotator which normalizes raw text from tagged text, e.g. scraped web
pages or xml documents, from document type columns into Sentence.
Removes all dirty characters from text following one or more input regex patterns. Can apply not wanted character removal with a specific policy. Can apply lower case normalization.
For extended examples of usage, see the `Examples <JohnSnowLabs/spark-nlp
>`__.
Input Annotation types
Output Annotation type
DOCUMENT
DOCUMENT
- Parameters:
- action
action to perform before applying regex patterns on text, by default “clean”
- patterns
normalization regex patterns which match will be removed from document, by default [‘<[^>]*>’]
- replacement
replacement string to apply when regexes match, by default “ “
- lowercase
whether to convert strings to lowercase, by default False
- policy
policy to remove pattern from text, by default “pretty_all”
- encoding
file encoding to apply on normalized documents, by default “UTF-8”
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> cleanUpPatterns = ["<[^>]>"] >>> documentNormalizer = DocumentNormalizer() \ ... .setInputCols("document") \ ... .setOutputCol("normalizedDocument") \ ... .setAction("clean") \ ... .setPatterns(cleanUpPatterns) \ ... .setReplacement(" ") \ ... .setPolicy("pretty_all") \ ... .setLowercase(True) >>> pipeline = Pipeline().setStages([ ... documentAssembler, ... documentNormalizer ... ]) >>> text = """ ... <div id="theworldsgreatest" class='my-right my-hide-small my-wide toptext' style="font-family:'Segoe UI',Arial,sans-serif"> ... THE WORLD'S LARGEST WEB DEVELOPER SITE ... <h1 style="font-size:300%;">THE WORLD'S LARGEST WEB DEVELOPER SITE</h1> ... <p style="font-size:160%;">Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum..</p> ... </div> ... </div>""" >>> data = spark.createDataFrame([[text]]).toDF("text") >>> pipelineModel = pipeline.fit(data) >>> result = pipelineModel.transform(data) >>> result.selectExpr("normalizedDocument.result").show(truncate=False) +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |result | +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |[ the world's largest web developer site the world's largest web developer site lorem ipsum is simply dummy text of the printing and typesetting industry. lorem ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. it has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. it was popularised in the 1960s with the release of letraset sheets containing lorem ipsum passages, and more recently with desktop publishing software like aldus pagemaker including versions of lorem ipsum..]| +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
- setAction(value)[source]#
Sets action to perform before applying regex patterns on text, by default “clean”.
- Parameters:
- valuestr
Action to perform before applying regex patterns
- setPatterns(value)[source]#
Sets normalization regex patterns which match will be removed from document, by default [‘<[^>]*>’].
- Parameters:
- valueList[str]
Normalization regex patterns which match will be removed from document
- setReplacement(value)[source]#
Sets replacement string to apply when regexes match, by default “ “.
- Parameters:
- valuestr
Replacement string to apply when regexes match
- setLowercase(value)[source]#
Sets whether to convert strings to lowercase, by default False.
- Parameters:
- valuebool
Whether to convert strings to lowercase, by default False