Description
Identifies whether two question sentences are semantically repetitive or different.
Predicted Entities
almost_same
, not_same
.
Live Demo Open in Colab Download Copy S3 URI
How to use
-
The model is trained with
sent_electra_large_uncased
embeddings therefore the same embeddings should be used in the prediction pipeline. -
The question pairs should be identified with “q1” and “q2” in the text. The input text format should be as follows :
text = "q1: What is your name? q2: Who are you?"
document = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
embeddings = BertSentenceEmbeddings.pretrained("sent_electra_large_uncased", "en") \
.setInputCols("document") \
.setOutputCol("sentence_embeddings")
document_classifier = ClassifierDLModel.pretrained('classifierdl_electra_questionpair', 'en') \
.setInputCols(["document", "sentence_embeddings"]) \
.setOutputCol("class")
nlpPipeline = Pipeline(stages=[document, embeddings, document_classifier])
light_pipeline = LightPipeline(nlpPipeline.fit(spark.createDataFrame([['']]).toDF("text")))
result_1 = light_pipeline.annotate("q1: What is your favorite movie? q2: Which movie do you like most?")
print(result_1["class"])
result_2 = light_pipeline.annotate("q1: What is your favorite movie? q2: Which movie genre would you like to watch?")
print(result_2["class"])
val document = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val embeddings = BertSentenceEmbeddings.pretrained("sent_electra_large_uncased", "en")
.setInputCols("document")
.setOutputCol("sentence_embeddings")
val document_classifier = ClassifierDLModel.pretrained("classifierdl_electra_questionpair", 'en')
.setInputCols(Array("document", "sentence_embeddings"))
.setOutputCol("class")
val nlpPipeline = new Pipeline().setStages(Array(document, embeddings, document_classifier))
val light_pipeline = LightPipeline(nlpPipeline.fit(spark.createDataFrame([['']]).toDF("text")))
val result_1 = light_pipeline.annotate("q1: What is your favorite movie? q2: Which movie do you like most?")
val result_2 = light_pipeline.annotate("q1: What is your favorite movie? q2: Which movie genre would you like to watch?")
import nlu
nlu.load("en.classify.questionpair").predict("""q1: What is your favorite movie? q2: Which movie genre would you like to watch?""")
Results
['almost_same']
['not_same']
Model Information
Model Name: | classifierdl_electra_questionpair |
Compatibility: | Spark NLP 3.1.3+ |
License: | Open Source |
Edition: | Official |
Input Labels: | [sentence_embeddings] |
Output Labels: | [class] |
Language: | en |
Data Source
A custom dataset is used based on this source : “https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs”.
Benchmarking
label precision recall f1-score support
almost_same 0.85 0.91 0.88 29652
not_same 0.90 0.84 0.87 29634
accuracy - - 0.88 59286
macro-avg 0.88 0.88 0.88 59286
weighted-avg 0.88 0.88 0.88 59286