Text Preprocessing

Text Preprocessing is the foundational task of cleaning and transforming raw text data into a structured format that can be used in NLP tasks. It involves a series of steps to normalize text, remove noise, and prepare it for deeper analysis. Spark NLP provides a range of tools for efficient and scalable text preprocessing.

Key Preprocessing Steps

When preprocessing text, consider the following key steps along with the recommended Spark NLP annotators:

Tokenization: Break text into smaller units (words, subwords, or sentences).
Spell Checking: Correct misspelled words to improve accuracy in NLP tasks.
Normalization: Standardize text by converting to lowercase, expanding contractions, or removing accents.
Stopword Removal: Remove common, non-informative words (e.g., “the,” “is,” “and”).
Lemmatization: Reduce words to their base form (e.g., “running” → “run”).

These steps and annotators will help ensure your text data is clean, consistent, and ready for analysis.

How to use

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

# Document Assembler: Converts input text into a suitable format for NLP processing
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

# Tokenizer: Splits text into individual tokens (words)
tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("tokens")

# SpellChecker: Corrects misspelled words
spellChecker = NorvigSweetingModel.pretrained() \
    .setInputCols(["tokens"]) \
    .setOutputCol("corrected")

# Normalizer: Cleans and standardizes text data
normalizer = Normalizer() \
    .setInputCols(["corrected"]) \
    .setOutputCol("normalized")

# StopWordsCleaner: Removes stopwords
stopwordsCleaner = StopWordsCleaner() \
    .setInputCols(["normalized"]) \
    .setOutputCol("cleanTokens")

# Lemmatizer: Reduces words to their base form
lemmatizer = LemmatizerModel.pretrained() \
    .setInputCols(["cleanTokens"]) \
    .setOutputCol("lemmas")

# Pipeline: Assembles the document assembler and preprocessing stages
pipeline = Pipeline().setStages([
    documentAssembler, tokenizer, spellChecker, normalizer, stopwordsCleaner, lemmatizer
])

# Input Data: A small example dataset is created and converted to a DataFrame
data = spark.createDataFrame([["Text preprocessing is essential in NLP!"]]).toDF("text")

# Running the Pipeline: Fits the pipeline to the data and preprocesses the text
result = pipeline.fit(data).transform(data)

# Output: Displays the processed tokens and lemmas
result.select("lemmas.result").show(truncate=False)

+----------------------------------------------------+
|lemmas.result                                       |
+----------------------------------------------------+
|[text, preprocess, essential, in, NLP]              |
+----------------------------------------------------+

import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline
import spark.implicits._

// Document Assembler: Converts input text into a suitable format for NLP processing
val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

// Tokenizer: Splits text into individual tokens (words)
val tokenizer = new Tokenizer()
  .setInputCols(Array("document"))
  .setOutputCol("tokens")

// SpellChecker: Corrects misspelled words
val spellChecker = NorvigSweetingModel.pretrained()
  .setInputCols(Array("tokens"))
  .setOutputCol("corrected")

// Normalizer: Cleans and standardizes text data
val normalizer = new Normalizer()
  .setInputCols(Array("corrected"))
  .setOutputCol("normalized")

// StopWordsCleaner: Removes stopwords
val stopwordsCleaner = new StopWordsCleaner()
  .setInputCols(Array("normalized"))
  .setOutputCol("cleanTokens")

// Lemmatizer: Reduces words to their base form
val lemmatizer = LemmatizerModel.pretrained()
  .setInputCols(Array("cleanTokens"))
  .setOutputCol("lemmas")

// Pipeline: Assembles the document assembler and preprocessing stages
val pipeline = new Pipeline().setStages(Array(
  documentAssembler, tokenizer, spellChecker, normalizer, stopwordsCleaner, lemmatizer
))

// Input Data: A small example dataset is created and converted to a DataFrame
val data = Seq("Text preprocessing is essential in NLP!").toDF("text")

// Running the Pipeline: Fits the pipeline to the data and preprocesses the text
val result = pipeline.fit(data).transform(data)

// Display the results
result.select("lemmas.result").show(false)

+----------------------------------------------------+
|result                                              |
+----------------------------------------------------+
|[text, preprocess, essential, in, NLP]              |
+----------------------------------------------------+

Try Real-Time Demos!

If you want to see text preprocessing in real-time, check out our interactive demos:

Text Preprocessing with Spark NLP – Explore how Spark NLP preprocesses raw text data.
Stopwords Removing with Spark NLP – Explore how Spark NLP removes stop words from text.

Useful Resources

Want to learn more about text preprocessing with Spark NLP? Explore the following resources:

Articles and Guides

Notebooks

PREVIOUSSpark NLP FAQ