Vietnamese DistilBERT Base Cased Embeddings

Description

This embeddings model was imported from Hugging Face. It’s a custom version of distilbert_base_multilingual_cased and it gives the same representations produced by the original model which preserves the original accuracy.

Predicted Entities

Download Copy S3 URI

How to use

...

distilbert = DistilBertEmbeddings.pretrained("distilbert_base_cased", "vi")\
.setInputCols(["sentence",'token'])\
.setOutputCol("embeddings")

nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, distilbert])

text = """Tôi yêu Spark NLP"""

data = spark.createDataFrame([[text]]).toDF("text")

result = nlpPipeline.fit(data).transform(data)
...

val embeddings = DistilBertEmbeddings.pretrained("distilbert_base_cased", "vi")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")

val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
val data = Seq("Tôi yêu Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
import nlu
nlu.load("vi.embed.distilbert.cased").predict("""Tôi yêu Spark NLP""")

Results

+-----+--------------------+
|token|          embeddings|
+-----+--------------------+
|  Tôi|[-0.38760236, -0....|
|  yêu|[-0.3357051, -0.5...|
|Spark|[-0.20642707, -0....|
|  NLP|[-0.013280544, -0...|
+-----+--------------------+

Model Information

Model Name: distilbert_base_cased
Compatibility: Spark NLP 3.3.4+
License: Open Source
Edition: Official
Input Labels: [sentence, token]
Output Labels: [embeddings]
Language: vi
Size: 211.6 MB
Case sensitive: false