Question Answering

 

Question Answering (QA) is the task of automatically answering questions posed by humans in natural language. It is a fundamental problem in natural language processing (NLP), playing a vital role in applications such as search engines, virtual assistants, customer support systems, and more. Spark NLP provides state-of-the-art (SOTA) models for QA tasks, enabling accurate and context-aware responses to user queries.

QA systems extract relevant information from a given context or knowledge base to answer a question. Depending on the model and input, they can either find exact answers within a text or generate a more comprehensive response.

Types of Question Answering

  • Open-Book QA: In this approach, the model has access to external documents, passages, or knowledge sources to extract the answer. The system looks for relevant information within the provided text (e.g., “What is the tallest mountain in the world?” answered using a document about mountains).

  • Closed-Book QA: Here, the model must rely solely on the knowledge it has been trained on, without access to external sources. The answer is generated from the model’s internal knowledge (e.g., answering trivia questions without referring to external material).

Common use cases include:

  • Fact-based QA: Answering factoid questions such as “What is the capital of France?”
  • Reading Comprehension: Extracting answers from a provided context, often used in assessments or educational tools.
  • Dialogue-based QA: Supporting interactive systems that maintain context across multiple turns of conversation.

By leveraging QA models, organizations can build robust systems that improve user engagement, provide instant information retrieval, and offer customer support in a more intuitive manner.

Picking a Model

When selecting a model for question answering, consider the following important factors. First, assess the nature of your data (e.g., structured knowledge base vs. unstructured text) and the type of QA needed (open-book or closed-book). Open-book QA requires models that can efficiently search and extract from external sources, while closed-book QA demands models with a large internal knowledge base.

Evaluate the complexity of the questions—are they simple factoids or require more reasoning and multi-turn interactions? Metrics such as Exact Match (EM) and F1 score are commonly used to measure model performance in QA tasks. Finally, take into account the computational resources available, as some models, like BERT or T5, may require significant processing power.

Explore models tailored for question answering at Spark NLP Models, where you’ll find various options for different QA tasks.

By selecting the appropriate question answering model, you can enhance your ability to deliver accurate and relevant answers tailored to your specific NLP tasks.

How to use

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

# 1. Document Assembler: Prepares the question and context text for further processing
documentAssembler = MultiDocumentAssembler() \
    .setInputCols(["question", "context"]) \
    .setOutputCol(["document_question", "document_context"])

# 2. Question Answering Model: Uses a pretrained RoBERTa model for QA
spanClassifier = RoBertaForQuestionAnswering.pretrained() \
    .setInputCols(["document_question", "document_context"]) \
    .setOutputCol("answer") \
    .setCaseSensitive(False)

# 3. Pipeline: Combines the stages (DocumentAssembler and RoBERTa model) into a pipeline
pipeline = Pipeline().setStages([
    documentAssembler,
    spanClassifier
])

# 4. Sample Data: Creating a DataFrame with a question and context
data = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")

# 5. Running the Pipeline: Fitting the pipeline to the data and generating answers
result = pipeline.fit(data).transform(data)

# 6. Displaying the Result: The output is the answer to the question extracted from the context
result.select("answer.result").show(truncate=False)

+--------------------+
|result              |
+--------------------+
|[Clara]             |
+--------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline

// 1. Document Assembler: Prepares the question and context text for further processing
val document = new MultiDocumentAssembler()
  .setInputCols("question", "context")
  .setOutputCols("document_question", "document_context")

// 2. Question Answering Model: Uses a pretrained RoBERTa model for QA
val questionAnswering = RoBertaForQuestionAnswering.pretrained()
  .setInputCols(Array("document_question", "document_context"))
  .setOutputCol("answer")
  .setCaseSensitive(true)

// 3. Pipeline: Combines the stages (DocumentAssembler and RoBERTa model) into a pipeline
val pipeline = new Pipeline().setStages(Array(
  document,
  questionAnswering
))

// 4. Sample Data: Creating a DataFrame with a question and context
val data = Seq("What's my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")

// 5. Running the Pipeline: Fitting the pipeline to the data and generating answers
val result = pipeline.fit(data).transform(data)

// 6. Displaying the Result: The output is the answer to the question extracted from the context
result.select("answer.result").show(false)

+---------------------+
|result               |
+---------------------+
|[Clara]              |
+---------------------+

Try Real-Time Demos!

If you want to see the outputs of question answering models in real time, visit our interactive demos:

Useful Resources

Want to dive deeper into question answering with Spark NLP? Here are some curated resources to help you get started and explore further:

Articles and Guides

Notebooks

Last updated