Text Preprocessing is the foundational task of cleaning and transforming raw text data into a structured format that can be used in NLP tasks. It involves a series of steps to normalize text, remove noise, and prepare it for deeper analysis. Spark NLP provides a range of tools for efficient and scalable text preprocessing.
Key Preprocessing Steps
When preprocessing text, consider the following key steps along with the recommended Spark NLP annotators:
Tokenization:
Break text into smaller units (words, subwords, or sentences).Spell Checking:
Correct misspelled words to improve accuracy in NLP tasks.Normalization:
Standardize text by converting to lowercase, expanding contractions, or removing accents.Stopword Removal:
Remove common, non-informative words (e.g., “the,” “is,” “and”).Lemmatization:
Reduce words to their base form (e.g., “running” → “run”).
These steps and annotators will help ensure your text data is clean, consistent, and ready for analysis.
How to use
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
# Document Assembler: Converts input text into a suitable format for NLP processing
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
# Tokenizer: Splits text into individual tokens (words)
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("tokens")
# SpellChecker: Corrects misspelled words
spellChecker = NorvigSweetingModel.pretrained() \
.setInputCols(["tokens"]) \
.setOutputCol("corrected")
# Normalizer: Cleans and standardizes text data
normalizer = Normalizer() \
.setInputCols(["corrected"]) \
.setOutputCol("normalized")
# StopWordsCleaner: Removes stopwords
stopwordsCleaner = StopWordsCleaner() \
.setInputCols(["normalized"]) \
.setOutputCol("cleanTokens")
# Lemmatizer: Reduces words to their base form
lemmatizer = LemmatizerModel.pretrained() \
.setInputCols(["cleanTokens"]) \
.setOutputCol("lemmas")
# Pipeline: Assembles the document assembler and preprocessing stages
pipeline = Pipeline().setStages([
documentAssembler, tokenizer, spellChecker, normalizer, stopwordsCleaner, lemmatizer
])
# Input Data: A small example dataset is created and converted to a DataFrame
data = spark.createDataFrame([["Text preprocessing is essential in NLP!"]]).toDF("text")
# Running the Pipeline: Fits the pipeline to the data and preprocesses the text
result = pipeline.fit(data).transform(data)
# Output: Displays the processed tokens and lemmas
result.select("lemmas.result").show(truncate=False)
+----------------------------------------------------+
|lemmas.result |
+----------------------------------------------------+
|[text, preprocess, essential, in, NLP] |
+----------------------------------------------------+
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline
import spark.implicits._
// Document Assembler: Converts input text into a suitable format for NLP processing
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
// Tokenizer: Splits text into individual tokens (words)
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("tokens")
// SpellChecker: Corrects misspelled words
val spellChecker = NorvigSweetingModel.pretrained()
.setInputCols(Array("tokens"))
.setOutputCol("corrected")
// Normalizer: Cleans and standardizes text data
val normalizer = new Normalizer()
.setInputCols(Array("corrected"))
.setOutputCol("normalized")
// StopWordsCleaner: Removes stopwords
val stopwordsCleaner = new StopWordsCleaner()
.setInputCols(Array("normalized"))
.setOutputCol("cleanTokens")
// Lemmatizer: Reduces words to their base form
val lemmatizer = LemmatizerModel.pretrained()
.setInputCols(Array("cleanTokens"))
.setOutputCol("lemmas")
// Pipeline: Assembles the document assembler and preprocessing stages
val pipeline = new Pipeline().setStages(Array(
documentAssembler, tokenizer, spellChecker, normalizer, stopwordsCleaner, lemmatizer
))
// Input Data: A small example dataset is created and converted to a DataFrame
val data = Seq("Text preprocessing is essential in NLP!").toDF("text")
// Running the Pipeline: Fits the pipeline to the data and preprocesses the text
val result = pipeline.fit(data).transform(data)
// Display the results
result.select("lemmas.result").show(false)
+----------------------------------------------------+
|result |
+----------------------------------------------------+
|[text, preprocess, essential, in, NLP] |
+----------------------------------------------------+
Try Real-Time Demos!
If you want to see text preprocessing in real-time, check out our interactive demos:
- Text Preprocessing with Spark NLP – Explore how Spark NLP preprocesses raw text data.
- Stopwords Removing with Spark NLP – Explore how Spark NLP removes stop words from text.
Useful Resources
Want to learn more about text preprocessing with Spark NLP? Explore the following resources:
Articles and Guides
- Text cleaning: removing stopwords from text with Spark NLP
- Unleashing the Power of Text Tokenization with Spark NLP
- Tokenizing Asian texts into words with word segmentation models in Spark NLP
- Text Cleaning: Standard Text Normalization with Spark NLP
- Boost Your NLP Results with Spark NLP Stemming and Lemmatizing Techniques
- Sample Text Data Preprocessing Implementation In SparkNLP
Notebooks
PREVIOUSSpark NLP FAQ