Description
This model is a distilled version of the BERT base model. It was introduced in this paper. The code for the distillation process can be found here. This model is cased: it does make a difference between english and English.
How to use
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence_detector = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
embeddings = DistilBertEmbeddings.pretrained("distilbert_base_cased", "en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
embeddings
])
data = spark.createDataFrame([["This is a test sentence for DistilBERT embeddings."]]).toDF("text")
result = nlp_pipeline.fit(data).transform(data)
result.select("embeddings.embeddings").show()
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings = DistilBertEmbeddings.pretrained("distilbert_base_cased", "en")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings
))
val data = Seq("This is a test sentence for DistilBERT embeddings.").toDF("text")
val result = pipeline.fit(data).transform(data)
result.select("embeddings").show(false)
Results
+--------------------+
| embeddings|
+--------------------+
|[[0.1959381, -0.2...|
+--------------------+
Model Information
| Model Name: | distilbert_base_cased |
| Compatibility: | Spark NLP 6.1.0+ |
| License: | Open Source |
| Edition: | Official |
| Input Labels: | [sentence, token] |
| Output Labels: | [embeddings] |
| Language: | en |
| Size: | 243.8 MB |
| Case sensitive: | true |
| Max sentence length: | 512 |