Token Classification

Token classification is the task of assigning a label to each token (word or sub-word) in a given text sequence. It is fundamental in various natural language processing (NLP) tasks like named entity recognition (NER), part-of-speech tagging (POS), and more. Spark NLP provides state of the art solutions to tackle token classification challenges effectively, helping you analyze and label individual tokens in a document.

Token classification involves processing text at a granular level, labeling each token for its role or entity. Typical use cases include:

Named Entity Recognition (NER): Identifying proper names, locations, organizations, etc., within text.
Part-of-Speech Tagging (POS): Labeling each token with its grammatical category (e.g., noun, verb, adjective).

By utilizing token classification, organizations can enhance their ability to extract detailed insights from text data, enabling applications like information extraction, text annotation, and more.

Picking a Model

When selecting a model for token classification, it’s important to consider various factors that impact performance. First, analyze the type of entities or tags you want to classify (e.g., named entities, parts of speech). Determine if your task requires fine-grained tagging (such as multiple types of named entities) or a simpler tag set.

Next, assess the complexity of your data—does it involve formal text like news articles, or informal text like social media posts? Model performance metrics (e.g., precision, recall, F1 score) are also key to determining whether a model is suitable. Lastly, evaluate your computational resources, as more complex models like BERT may require greater memory and processing power.

You can explore and select models for your token classification tasks at Spark NLP Models, where you’ll find various models for specific datasets and challenges.

Recommended Models for Specific Token Classification Tasks

Named Entity Recognition (NER): Use models like bert-base-NER and xlm-roberta-large-finetuned-conll03-english for general-purpose NER tasks.
Part-of-Speech Tagging (POS): For POS tagging, consider using models such as pos_anc.
Healthcare NER: For clinical texts, ner_jsl and pos_clinical is tailored for extracting medical entities.

If existing models do not meet your requirements, you can train your own custom model using the Spark NLP Training Documentation.

By selecting the appropriate model, you can optimize token classification performance for your specific NLP tasks.

How to use

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

# Assembling the document from the input text
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

# Tokenizing the text
tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

# Loading a pre-trained sequence classification model
# You can replace `BertForTokenClassification.pretrained()` with your selected model and the transformer it's based on
# For example: XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_large_finetuned_conll03_english","xx")
tokenClassifier = BertForTokenClassification.pretrained() \
    .setInputCols(["token", "document"]) \
    .setOutputCol("label") \
    .setCaseSensitive(True)

# Defining the pipeline with document assembler, tokenizer, and classifier
pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    tokenClassifier
])

# Creating a sample DataFrame
data = spark.createDataFrame([["John Lenon was born in London and lived in Paris. My name is Sarah and I live in London"]]).toDF("text")

# Fitting the pipeline and transforming the data
result = pipeline.fit(data).transform(data)

# Showing the results
result.select("label.result").show(truncate=False)

<!-- 
+------------------------------------------------------------------------------------+
|result                                                                              |
+------------------------------------------------------------------------------------+
|[B-PER, I-PER, O, O, O, B-LOC, O, O, O, B-LOC, O, O, O, O, B-PER, O, O, O, O, B-LOC]|
+------------------------------------------------------------------------------------+
-->

import spark.implicits._
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline

// Step 1: Assembling the document from the input text
// Converts the input 'text' column into a 'document' column, required for NLP tasks
val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

// Step 2: Tokenizing the text
// Splits the 'document' column into tokens (words), creating the 'token' column
val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

// Step 3: Loading a pre-trained BERT model for token classification
// Applies a pre-trained BERT model for Named Entity Recognition (NER) to classify tokens
// `BertForTokenClassification.pretrained()` loads the model, and `setInputCols` defines the input columns
val tokenClassifier = BertForTokenClassification.pretrained()
  .setInputCols("token", "document")
  .setOutputCol("label")
  .setCaseSensitive(true)

// Step 4: Defining the pipeline
// The pipeline stages are document assembler, tokenizer, and token classifier
val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  tokenizer,
  tokenClassifier
))

// Step 5: Creating a sample DataFrame
// Creates a DataFrame with a sample sentence that will be processed by the pipeline
val data = Seq("John Lenon was born in London and lived in Paris. My name is Sarah and I live in London").toDF("text")

// Step 6: Fitting the pipeline and transforming the data
// The pipeline is fitted on the input data, then it performs the transformation to generate token labels
val result = pipeline.fit(data).transform(data)

// Step 7: Showing the results
// Displays the 'label.result' column, which contains the Named Entity Recognition (NER) labels for each token
result.select("label.result").show(false)

// Output:
// +------------------------------------------------------------------------------------+
// |result                                                                              |
// +------------------------------------------------------------------------------------+
// |[B-PER, I-PER, O, O, O, B-LOC, O, O, O, B-LOC, O, O, O, O, B-PER, O, O, O, O, B-LOC]|
// +------------------------------------------------------------------------------------+

Try Real-Time Demos!

If you want to see the outputs of text classification models in real time, visit our interactive demos:

BERT Annotators Demo – A live demo where you can try your inputs on classification models on the go.
Named Entity Recognition (NER) – A live demo where you can try your inputs on NER models on the go.
POS Tagging – A live demo where you can try your inputs on preception models on the go.
Recognize Entities - Live Demos & Notebooks – An interactive demo for Recognizing Entities in text

Useful Resources

Want to dive deeper into text classification with Spark NLP? Here are some curated resources to help you get started and explore further:

Articles and Guides

Notebooks

Transformers for Token Classification in Spark NLP

Training Scripts

PREVIOUSSpark NLP FAQ