Duplicate Question Detection

Description

This model was imported from Hugging Face (link) and it’s been trained on Quora Question Pairs dataset, leveraging Distil-BERT embeddings and DistilBertForSequenceClassification for text classification purposes. As an input, it requires two questions separated by a space.

Predicted Entities

non_duplicated, duplicated

Download Copy S3 URI

How to use

document_assembler = DocumentAssembler() \
     .setInputCol('text') \
     .setOutputCol('document')

 tokenizer = Tokenizer() \
     .setInputCols(['document']) \
     .setOutputCol('token')

 sequenceClassifier = DistilBertForSequenceClassification.pretrained("distilbert_base_sequence_classifier_qqp", "en")\
   .setInputCols(["document",'token'])\
   .setOutputCol("class")

 pipeline = Pipeline(stages=[document_assembler, tokenizer, sequenceClassifier])

 light_pipeline = LightPipeline(pipeline.fit(spark.createDataFrame([['']]).toDF("text")))

 result1 = light_pipeline.annotate("Do we have to go there? Are you a doctor?")
 result2 = light_pipeline.annotate("Do you want to eat something? Are you hungry?")
val document_assembler = DocumentAssembler()
     .setInputCol("text")
     .setOutputCol("document")

 val tokenizer = Tokenizer()
     .setInputCols(Array("document"))
     .setOutputCol("token")

 val sequenceClassifier = DistilBertForSequenceClassification.pretrained("distilbert_base_sequence_classifier_qqp", "en")
   .setInputCols(Array("document", "token"))
   .setOutputCol("class")

 val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, sequenceClassifier))

 val example1 = Seq.empty["Do we have to go there? Are you a doctor?"].toDS.toDF("text")
 val example2 = Seq.empty["Do you want to eat something? Are you hungry?"].toDS.toDF("text")
 val result1 = pipeline.fit(example1).transform(example1)
 val result2 = pipeline.fit(example2).transform(example2)
import nlu
nlu.load("en.classify.qqp.distil_bert.base").predict("""Do you want to eat something? Are you hungry?""")

Results

['non_duplicated']
['duplicated']

Model Information

Model Name: distilbert_base_sequence_classifier_qqp
Compatibility: Spark NLP 3.4.0+
License: Open Source
Edition: Official
Input Labels: [sentence, token]
Output Labels: [class]
Language: en
Size: 249.8 MB
Case sensitive: true
Max sentence length: 256

References

https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs