DistilBERT base model (Cased, ONNX)

Description

This model is a distilled version of the BERT base model. It was introduced in this paper. The code for the distillation process can be found here. This model is cased: it does make a difference between english and English.

Download Copy S3 URI

How to use

from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence_detector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

embeddings = DistilBertEmbeddings.pretrained("distilbert_base_cased", "en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

nlp_pipeline = Pipeline(stages=[
    document_assembler, 
    sentence_detector, 
    tokenizer, 
    embeddings
])

data = spark.createDataFrame([["This is a test sentence for DistilBERT embeddings."]]).toDF("text")

result = nlp_pipeline.fit(data).transform(data)
result.select("embeddings.embeddings").show()
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

val embeddings = DistilBertEmbeddings.pretrained("distilbert_base_cased", "en")
  .setInputCols("sentence", "token")
  .setOutputCol("embeddings")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings
))

val data = Seq("This is a test sentence for DistilBERT embeddings.").toDF("text")

val result = pipeline.fit(data).transform(data)
result.select("embeddings").show(false)

Results


+--------------------+
|          embeddings|
+--------------------+
|[[0.1959381, -0.2...|
+--------------------+

Model Information

Model Name: distilbert_base_cased
Compatibility: Spark NLP 6.1.0+
License: Open Source
Edition: Official
Input Labels: [sentence, token]
Output Labels: [embeddings]
Language: en
Size: 243.8 MB
Case sensitive: true
Max sentence length: 512