Description
This model is a fine-tuned XLM-Roberta base model over the 40 languages provided by XTREME from Wikiann. We used Masked language modeling (MLM)
by randomly masking 15% of the dataset ([MASK]
).
XLM-RoBERTa is a scaled cross-lingual sentence encoder. It is trained on 2.5T of data across 100 languages data filtered from Common Crawl. XLM-R achieves state-of-the-arts results on multiple cross-lingual benchmarks.
The XLM-RoBERTa model was proposed in Unsupervised Cross-lingual Representation Learning at Scale by Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov.
It is based on Facebook’s RoBERTa model released in 2019. It is a large multi-lingual language model, trained on 2.5TB of filtered CommonCrawl data.
How to use
embeddings = XlmRoBertaEmbeddings.pretrained("xlm_roberta_xtreme_base", "xx") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
val embeddings = XlmRoBertaEmbeddings.pretrained("xlm_roberta_xtreme_base", "xx")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
import nlu
nlu.load("xx.embed.xlm_roberta_xtreme_base").predict("""Put your text here.""")
Model Information
Model Name: | xlm_roberta_xtreme_base |
Compatibility: | Spark NLP 3.1.3+ |
License: | Open Source |
Edition: | Official |
Input Labels: | [token, sentence] |
Output Labels: | [embeddings] |
Language: | xx |
Case sensitive: | true |
Max sentense length: | 128 |