Description
This embeddings model was imported from Hugging Face
. It’s a custom version of distilbert_base_multilingual_cased
and it gives the same representations produced by the original model which preserves the original accuracy.
Predicted Entities
How to use
...
distilbert = DistilBertEmbeddings.pretrained("distilbert_base_cased", "vi")\
.setInputCols(["sentence",'token'])\
.setOutputCol("embeddings")
nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, distilbert])
text = """Tôi yêu Spark NLP"""
data = spark.createDataFrame([[text]]).toDF("text")
result = nlpPipeline.fit(data).transform(data)
...
val embeddings = DistilBertEmbeddings.pretrained("distilbert_base_cased", "vi")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
val data = Seq("Tôi yêu Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
import nlu
nlu.load("vi.embed.distilbert.cased").predict("""Tôi yêu Spark NLP""")
Results
+-----+--------------------+
|token| embeddings|
+-----+--------------------+
| Tôi|[-0.38760236, -0....|
| yêu|[-0.3357051, -0.5...|
|Spark|[-0.20642707, -0....|
| NLP|[-0.013280544, -0...|
+-----+--------------------+
Model Information
Model Name: | distilbert_base_cased |
Compatibility: | Spark NLP 3.3.4+ |
License: | Open Source |
Edition: | Official |
Input Labels: | [sentence, token] |
Output Labels: | [embeddings] |
Language: | vi |
Size: | 211.6 MB |
Case sensitive: | false |