Description
The universal sentence encoder family of models maps the text into high dimensional vectors that capture sentence-level semantics. Our Multilingual-base model is trained using a conditional masked language model described in [1]. The model is intended to be used for text classification, text clustering, semantic textual similarity, etc. The base model employs a 12 layer BERT transformer architecture.
The model extends the BERT transformer architecture that is why we use it with BertSentenceEmbeddings.
[1] Ziyi Yang, Yinfei Yang, Daniel Cer, Jax Law, Eric Darve. Universal Sentence Representations Learning with Conditional Masked Language Model. November 2020
How to use
embeddings = BertSentenceEmbeddings.pretrained("sent_bert_use_cmlm_multi_base", "xx") \
.setInputCols("sentence") \
.setOutputCol("sentence_embeddings")
val embeddings = BertSentenceEmbeddings.pretrained("sent_bert_use_cmlm_multi_base", "xx")
.setInputCols("sentence")
.setOutputCol("sentence_embeddings")
import nlu
text = ["I hate cancer", "Antibiotics aren't painkiller"]
embeddings_df = nlu.load('xx.embed_sentence.sent_bert_use_cmlm_multi_base').predict(text, output_level='sentence')
embeddings_df
Model Information
Model Name: | sent_bert_use_cmlm_multi_base |
Compatibility: | Spark NLP 3.1.3+ |
License: | Open Source |
Edition: | Official |
Input Labels: | [sentence] |
Output Labels: | [bert] |
Language: | xx |
Case sensitive: | true |
Data Source
https://tfhub.dev/google/universal-sentence-encoder-cmlm/multilingual-base/1
Benchmarking
We evaluate this model on XEVAL, translated SentEval sentence representation benchmark. XEVAL will be publicly available soon.
XEVAL ar bg de .... zh 15 Languages Average
USE-CMLM-Multilingual-Base 80.6 81.2 82.6 .... 81.7 81.2
USE-CMLM-Multilingual-Base + BR 82.6 83.0 84.0 .... 83.0 82.8