Text Preprocessing

 

Text preprocessing is a critical step in Natural Language Processing (NLP) that converts raw, unstructured text into a clean and analyzable form. It typically includes operations such as tokenization, lowercasing, stopword removal, lemmatization or stemming, and handling of punctuation or special characters. These steps reduce noise, ensure uniformity, and improve the performance of downstream NLP models.

Key Preprocessing Steps

When preprocessing text, consider the following key steps along with the recommended Spark NLP annotators:

  1. Tokenization: Break text into smaller units (words, subwords, or sentences).
  2. Spell Checking: Correct misspelled words to improve accuracy in NLP tasks.
  3. Normalization: Standardize text by converting to lowercase, expanding contractions, or removing accents.
  4. Stopword Removal: Remove common, non-informative words (e.g., “the,” “is,” “and”).
  5. Lemmatization: Reduce words to their base form (e.g., “running” → “run”).

How to use

from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("tokens")

spellChecker = NorvigSweetingModel.pretrained() \
    .setInputCols(["tokens"]) \
    .setOutputCol("corrected")

normalizer = Normalizer() \
    .setInputCols(["corrected"]) \
    .setOutputCol("normalized")

stopwordsCleaner = StopWordsCleaner() \
    .setInputCols(["normalized"]) \
    .setOutputCol("cleanTokens")

lemmatizer = LemmatizerModel.pretrained() \
    .setInputCols(["cleanTokens"]) \
    .setOutputCol("lemmas")

pipeline = Pipeline().setStages([
    documentAssembler, tokenizer, spellChecker, normalizer, stopwordsCleaner, lemmatizer
])

data = spark.createDataFrame([["Dr. Emily Johnson visited New York's Mount Sinai Hospital on September 21, 2023, to evaluate patients suffering from chronic migraines."]]).toDF("text")

model = pipeline.fit(data)
result = model.transform(data)

import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("tokens")

val spellChecker = NorvigSweetingModel.pretrained()
  .setInputCols("tokens")
  .setOutputCol("corrected")

val normalizer = new Normalizer()
  .setInputCols("corrected")
  .setOutputCol("normalized")

val stopwordsCleaner = new StopWordsCleaner()
  .setInputCols("normalized")
  .setOutputCol("cleanTokens")

val lemmatizer = LemmatizerModel.pretrained()
  .setInputCols("cleanTokens")
  .setOutputCol("lemmas")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler, tokenizer, spellChecker, normalizer, stopwordsCleaner, lemmatizer
))

val data = Seq("Dr. Emily Johnson visited New York's Mount Sinai Hospital on September 21, 2023, to evaluate patients suffering from chronic migraines.")
  .toDF("text")

val model = pipeline.fit(data)
val result = model.transform(data)

+---------+---------+----------+----------+---------+
|Token    |Corrected|Normalized|CleanToken|Lemma    |
+---------+---------+----------+----------+---------+
|Dr       |Dr       |Dr        |Dr        |Dr       |
|.        |.        |Emily     |Emily     |Emily    |
|Emily    |Emily    |Johnson   |Johnson   |Johnson  |
|Johnson  |Johnson  |visited   |visited   |visit    |
|visited  |visited  |New       |New       |New      |
|New      |New      |Yorks     |Yorks     |Yorks    |
|York's   |Yorks    |Mount     |Mount     |Mount    |
|Mount    |Mount    |Sinai     |Sinai     |Sinai    |
|Sinai    |Sinai    |Hospital  |Hospital  |Hospital |
|Hospital |Hospital |on        |September |September|
|on       |on       |September |evaluate  |evaluate |
|September|September|to        |patients  |patient  |
|21       |21       |evaluate  |suffering |suffer   |
|,        |,        |patients  |chronic   |chronic  |
|2023     |2023     |suffering |migraines |migraine |
|,        |,        |from      |NULL      |NULL     |
|to       |to       |chronic   |NULL      |NULL     |
|evaluate |evaluate |migraines |NULL      |NULL     |
|patients |patients |NULL      |NULL      |NULL     |
|suffering|suffering|NULL      |NULL      |NULL     |
+---------+---------+----------+----------+---------+

Try Real-Time Demos!

If you want to see text preprocessing in real-time, check out our interactive demos:

Useful Resources

Want to learn more about text preprocessing with Spark NLP? Explore the following resources:

Articles and Guides

Notebooks

Last updated