Google's T5 for closed book question answering

Description

This is a text-to-text model trained by Google on the colossal, cleaned version of Common Crawl’s web crawl corpus (C4) data set and then fined tuned on Wikipedia and the natural questions (NQ) dataset. The model can answer free text questions, such as “Which is the capital of France ?” without relying on any context or external resources.

Predicted Entities

Download Copy S3 URICopied!

How to use

from sparknlp.annotator import SentenceDetectorDLModel, T5Transformer

data = self.spark.createDataFrame([
[1, "Which is the capital of France? Who was the first president of USA?"],
[1, "Which is the capital of Bulgaria ?"],
[2, "Who is Donald Trump?"]]).toDF("id", "text")

document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("documents")

sentence_detector = SentenceDetectorDLModel\
.pretrained()\
.setInputCols(["documents"])\
.setOutputCol("questions")

t5 = T5Transformer()\
.pretrained("google_t5_small_ssm_nq")\
.setInputCols(["questions"])\
.setOutputCol("answers")\

pipeline = Pipeline().setStages([document_assembler, sentence_detector, t5])
results = pipeline.fit(data).transform(data)

results.select("questions.result", "answers.result").show(truncate=False)

Results

+-------------------------------------------------------------------------------------------------------------+-----------------------------------------+
|result                                                                                                                 |result                                     |
+-------------------------------------------------------------------------------------------------------------+-----------------------------------------+
|[Which is the capital of France?, Who was the first president of USA?]|[Paris, George Washington]|
|[Which is the capital of Bulgaria ?]                                                              |[Sofia]                                     |
|[Who is Donald Trump?]                                                                                |[a United States citizen]      |
+------------------------------------------------------------------------------------------------------------+------------------------------------------+

Model Information

Model Name: google_t5_small_ssm_nq
Compatibility: Spark NLP 4.0.0+
License: Open Source
Edition: Official
Input Labels: [documents]
Output Labels: [t5]
Language: en
Size: 179.1 MB

References

C4, Wikipedia, NQ