Text Classification

 

Text classification is the process of assigning a category or label to a piece of text, such as an email, tweet, or review. It plays a crucial role in natural language processing (NLP), where it is used to automatically organize text into predefined categories. Spark NLP provides various solutions to address text classification challenges effectively.

In this context, text classification involves analyzing a document’s content to categorize it into one or more predefined groups. Common use cases include:

  • Organizing news articles into categories like politics, sports, entertainment, or technology.
  • Conducting sentiment analysis, where customer reviews of products or services are classified as positive, negative, or neutral.

By leveraging text classification, organizations can enhance their ability to process and understand large volumes of text data efficiently.

Picking a Model

When selecting a model for text classification, it’s crucial to evaluate several factors to ensure optimal performance for your specific use case. Start by analyzing the nature of your data, considering whether it is formal or informal and its length (e.g., tweets vs. reviews). Determine if your task requires binary classification (like spam detection) or multiclass classification (such as categorizing news topics), as some models excel in specific scenarios.

Next, assess the model complexity; simpler models like Logistic Regression work well for straightforward tasks, while more complex models like BERT are suited for nuanced understanding. Consider the availability of labeled data—larger datasets allow for training sophisticated models, whereas smaller datasets may benefit from pre-trained options. Define key performance metrics (e.g., accuracy, F1 score) to inform your choice, and ensure the model’s interpretability meets your requirements. Finally, account for resource constraints, as advanced models will demand more memory and processing power.

To explore and select from a variety of models, visit Spark NLP Models, where you can find models tailored for different tasks and datasets.

If you have specific needs that are not covered by existing models, you can train your own model tailored to your unique requirements. Follow the guidelines provided in the Spark NLP Training Documentation to get started on creating and training a model suited for your text classification task.

By thoughtfully considering these factors and using the right models, you can enhance your NLP applications significantly.

How to use

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

# Assembling the document from the input text
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

# Tokenizing the text
tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

# Loading a pre-trained sequence classification model
# You can replace `BertForSequenceClassification.pretrained()` with your selected model 
# For example: BertForSequenceClassification.pretrained("distilbert_sequence_classifier_sst2", "en")
sequenceClassifier = BertForSequenceClassification.pretrained() \
    .setInputCols(["token", "document"]) \
    .setOutputCol("label") \
    .setCaseSensitive(True)

# Defining the pipeline with document assembler, tokenizer, and classifier
pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    sequenceClassifier
])

# Creating a sample DataFrame
data = spark.createDataFrame([["I loved this movie when I was a child.", "It was pretty boring."]]).toDF("text")

# Fitting the pipeline and transforming the data
result = pipeline.fit(data).transform(data)

# Showing the classification result
result.select("label.result").show(truncate=False)

+------+
|result|
+------+
|[pos] |
|[neg] |
+------+
import spark.implicits._
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline

// Step 1: Convert raw text into document format
val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

// Step 2: Tokenize the document into words
val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

// Step 3: Load a pre-trained BERT model for sequence classification
val sequenceClassifier = BertForSequenceClassification.pretrained()
  .setInputCols("token", "document")
  .setOutputCol("label")
  .setCaseSensitive(true)

// Step 4: Define the pipeline with stages for document assembly, tokenization, and classification
val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  tokenizer,
  sequenceClassifier
))

// Step 5: Create sample data and apply the pipeline
val data = Seq("I loved this movie when I was a child.", "It was pretty boring.").toDF("text")
val result = pipeline.fit(data).transform(data)

// Step 6: Show the classification results
result.select("label.result").show(false)

+------+
|result|
+------+
|[pos] |
|[neg] |
+------+

Try Real-Time Demos!

If you want to see the outputs of text classification models in real time, visit our interactive demos:

Useful Resources

Want to dive deeper into text classification with Spark NLP? Here are some curated resources to help you get started and explore further:

Articles and Guides

Notebooks

Training Scripts

Last updated