Text preprocessing is a critical step in Natural Language Processing (NLP) that converts raw, unstructured text into a clean and analyzable form. It typically includes operations such as tokenization, lowercasing, stopword removal, lemmatization or stemming, and handling of punctuation or special characters. These steps reduce noise, ensure uniformity, and improve the performance of downstream NLP models.
Key Preprocessing Steps
When preprocessing text, consider the following key steps along with the recommended Spark NLP annotators:
Tokenization:Break text into smaller units (words, subwords, or sentences).Spell Checking:Correct misspelled words to improve accuracy in NLP tasks.Normalization:Standardize text by converting to lowercase, expanding contractions, or removing accents.Stopword Removal:Remove common, non-informative words (e.g., “the,” “is,” “and”).Lemmatization:Reduce words to their base form (e.g., “running” → “run”).
How to use
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("tokens")
spellChecker = NorvigSweetingModel.pretrained() \
.setInputCols(["tokens"]) \
.setOutputCol("corrected")
normalizer = Normalizer() \
.setInputCols(["corrected"]) \
.setOutputCol("normalized")
stopwordsCleaner = StopWordsCleaner() \
.setInputCols(["normalized"]) \
.setOutputCol("cleanTokens")
lemmatizer = LemmatizerModel.pretrained() \
.setInputCols(["cleanTokens"]) \
.setOutputCol("lemmas")
pipeline = Pipeline().setStages([
documentAssembler, tokenizer, spellChecker, normalizer, stopwordsCleaner, lemmatizer
])
data = spark.createDataFrame([["Dr. Emily Johnson visited New York's Mount Sinai Hospital on September 21, 2023, to evaluate patients suffering from chronic migraines."]]).toDF("text")
model = pipeline.fit(data)
result = model.transform(data)
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("tokens")
val spellChecker = NorvigSweetingModel.pretrained()
.setInputCols("tokens")
.setOutputCol("corrected")
val normalizer = new Normalizer()
.setInputCols("corrected")
.setOutputCol("normalized")
val stopwordsCleaner = new StopWordsCleaner()
.setInputCols("normalized")
.setOutputCol("cleanTokens")
val lemmatizer = LemmatizerModel.pretrained()
.setInputCols("cleanTokens")
.setOutputCol("lemmas")
val pipeline = new Pipeline().setStages(Array(
documentAssembler, tokenizer, spellChecker, normalizer, stopwordsCleaner, lemmatizer
))
val data = Seq("Dr. Emily Johnson visited New York's Mount Sinai Hospital on September 21, 2023, to evaluate patients suffering from chronic migraines.")
.toDF("text")
val model = pipeline.fit(data)
val result = model.transform(data)
+---------+---------+----------+----------+---------+
|Token |Corrected|Normalized|CleanToken|Lemma |
+---------+---------+----------+----------+---------+
|Dr |Dr |Dr |Dr |Dr |
|. |. |Emily |Emily |Emily |
|Emily |Emily |Johnson |Johnson |Johnson |
|Johnson |Johnson |visited |visited |visit |
|visited |visited |New |New |New |
|New |New |Yorks |Yorks |Yorks |
|York's |Yorks |Mount |Mount |Mount |
|Mount |Mount |Sinai |Sinai |Sinai |
|Sinai |Sinai |Hospital |Hospital |Hospital |
|Hospital |Hospital |on |September |September|
|on |on |September |evaluate |evaluate |
|September|September|to |patients |patient |
|21 |21 |evaluate |suffering |suffer |
|, |, |patients |chronic |chronic |
|2023 |2023 |suffering |migraines |migraine |
|, |, |from |NULL |NULL |
|to |to |chronic |NULL |NULL |
|evaluate |evaluate |migraines |NULL |NULL |
|patients |patients |NULL |NULL |NULL |
|suffering|suffering|NULL |NULL |NULL |
+---------+---------+----------+----------+---------+
Try Real-Time Demos!
If you want to see text preprocessing in real-time, check out our interactive demos:
Useful Resources
Want to learn more about text preprocessing with Spark NLP? Explore the following resources:
Articles and Guides
- Text cleaning: removing stopwords from text with Spark NLP
- Unleashing the Power of Text Tokenization with Spark NLP
- Tokenizing Asian texts into words with word segmentation models in Spark NLP
- Text Cleaning: Standard Text Normalization with Spark NLP
- Boost Your NLP Results with Spark NLP Stemming and Lemmatizing Techniques
- Sample Text Data Preprocessing Implementation In SparkNLP
Notebooks
PREVIOUSSpark NLP FAQ