Question Answering

Question Answering (QA) is a natural language processing task where models provide answers to questions using a given context, or in some cases, from their own knowledge without any context. For example, given the question “Which name is also used to describe the Amazon rainforest in English?” and the context “The Amazon rainforest, also known in English as Amazonia or the Amazon Jungle”, a QA model would output “Amazonia”. This capability makes QA ideal for searching within documents and for powering systems like FAQ automation, customer support, and knowledge-base search.

Types of Question Answering

There are several QA variants:

Extractive QA, which pulls the exact answer span from a context and is often solved with BERT-like models
Open generative QA, which generates natural-sounding answers based on a provided context
Closed generative QA, where the model answers entirely from its internal knowledge without any context.

QA systems can also be open-domain, covering a wide range of topics, or closed-domain, focused on specialized areas such as law or medicine. Together, these variants make QA a core building block for modern applications such as search engines, chatbots, and virtual assistants. ∂

Picking a Model

Depending on the QA variant and domain, different models are typically favored: BERT, RoBERTa, and ALBERT are strong choices for extractive QA; T5, BART, and newer families like LLaMA 2 excel at open generative QA; models such as LLaMA 2 and Mistral are well suited for closed generative QA; and domain-specific variants like BioBERT, LegalBERT, and SciBERT deliver the highest performance for specialized fields. A good rule of thumb is to use extractive QA when precise span-level answers are needed from a document, open generative QA when natural and context-aware responses are desired, and closed generative QA when relying on a model’s internal knowledge is sufficient or preferable.

Recommended Models for Specific QA Tasks

Extractive QA: Models such as distilbert-base-cased-distilled-squad and bert-large-uncased-whole-word-masking-finetuned-squad are well suited for identifying precise answer spans directly from context.
Open Generative QA (context-based): For generating fluent, context-aware answers, models like t5_base and bart_base provide strong performance.
Closed Generative QA (knowledge-based): When answers must be drawn from the model’s internal knowledge rather than external context, options such as roberta-base-squad2 and newer families like llama_2_7b_chat are effective choices.
Domain-Specific QA: For specialized use cases, consider domain-adapted models such as dmis-lab/biobert-large-cased-v1.1-squad for biomedical tasks, Beri/legal-qa for legal texts, or ktrapeznikov/scibert_scivocab_uncased_squad_v2 for scientific literature.

How to use

from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = MultiDocumentAssembler() \
    .setInputCols(["question", "context"]) \
    .setOutputCols(["document_question", "document_context"])

spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_base_cased_qa_squad2", "en") \
    .setInputCols(["document_question", "document_context"]) \
    .setOutputCol("answer") \
    .setCaseSensitive(True)

pipeline = Pipeline(stages=[
    documentAssembler,
    spanClassifier
])

data = spark.createDataFrame([
    ["What is my name?", "My name is Clara and I live in Berkeley."]
]).toDF("question", "context")

model = pipeline.fit(data)
result = model.transform(data)

result.select("question", "context", "answer.result").show(truncate=False)

import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline

val documentAssembler = new MultiDocumentAssembler()
  .setInputCols(Array("question", "context"))
  .setOutputCols(Array("document_question", "document_context"))

val spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_base_cased_qa_squad2", "en")
  .setInputCols(Array("document_question", "document_context"))
  .setOutputCol("answer")
  .setCaseSensitive(true)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  spanClassifier
))

val data = Seq(
  ("What is my name?", "My name is Clara and I live in Berkeley.")
).toDF("question", "context")

val model = pipeline.fit(data)
val result = model.transform(data)

result.select("question", "context", "answer.result").show(truncate = false)

+----------------+----------------------------------------+-------+
|question        |context                                 |result |
+----------------+----------------------------------------+-------+
|What is my name?|My name is Clara and I live in Berkeley.|[Clara]|
+----------------+----------------------------------------+-------+

Try Real-Time Demos!

If you want to see the outputs of question answering models in real time, visit our interactive demos:

Useful Resources

Want to dive deeper into question answering with Spark NLP? Here are some curated resources to help you get started and explore further:

Articles and Guides

Notebooks

PREVIOUSSpark NLP FAQ