Description
This is a text-to-text model trained by Google on the colossal, cleaned version of Common Crawl’s web crawl corpus (C4) data set and then fined tuned on Wikipedia and the natural questions (NQ) dataset. The model can answer free text questions, such as “Which is the capital of France ?” without relying on any context or external resources.
Predicted Entities
How to use
from sparknlp.annotator import SentenceDetectorDLModel, T5Transformer
data = self.spark.createDataFrame([
[1, "Which is the capital of France? Who was the first president of USA?"],
[1, "Which is the capital of Bulgaria ?"],
[2, "Who is Donald Trump?"]]).toDF("id", "text")
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("documents")
sentence_detector = SentenceDetectorDLModel\
.pretrained()\
.setInputCols(["documents"])\
.setOutputCol("questions")
t5 = T5Transformer()\
.pretrained("google_t5_small_ssm_nq")\
.setInputCols(["questions"])\
.setOutputCol("answers")\
pipeline = Pipeline().setStages([document_assembler, sentence_detector, t5])
results = pipeline.fit(data).transform(data)
results.select("questions.result", "answers.result").show(truncate=False)
Results
+-------------------------------------------------------------------------------------------------------------+-----------------------------------------+
|result |result |
+-------------------------------------------------------------------------------------------------------------+-----------------------------------------+
|[Which is the capital of France?, Who was the first president of USA?]|[Paris, George Washington]|
|[Which is the capital of Bulgaria ?] |[Sofia] |
|[Who is Donald Trump?] |[a United States citizen] |
+------------------------------------------------------------------------------------------------------------+------------------------------------------+
Model Information
Model Name: | google_t5_small_ssm_nq |
Compatibility: | Spark NLP 4.0.0+ |
License: | Open Source |
Edition: | Official |
Input Labels: | [documents] |
Output Labels: | [t5] |
Language: | en |
Size: | 179.1 MB |
References
C4, Wikipedia, NQ