Description
This model uses a BERT base architecture pretrained from scratch using the Wikipedia, Common Crawl, PMINDIA and Dakshina corpora for the following 17 Indian languages:
Assamese
, Bengali
, English
, Gujarati
, Hindi
, Kannada
, Kashmiri
, Malayalam
, Marathi
, Nepali
, Oriya
, Punjabi
, Sanskrit
, Sindhi
, Tamil
, Telugu
, Urdu
The MuRIL model is pre-trained on monolingual segments as well as parallel segments as detailed below :
- Monolingual Data : Publicly available corpora from Wikipedia and Common Crawl for 17 Indian languages.
- Parallel Data : There are two types of parallel data :
- Translated Data : Translations of the above monolingual corpora obtained using the Google NMT pipeline. Translated segment pairs fed as input. Also, Publicly available PMINDIA corpus was used.
- Transliterated Data : Transliterations of Wikipedia obtained using the IndicTrans library. Transliterated segment pairs fed as input. Also, Publicly available Dakshina dataset was used.
Predicted Entities
How to use
sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_muril", "xx") \
.setInputCols("sentence") \
.setOutputCol("bert_sentence")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, sent_embeddings ])
val sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_muril", "xx")
.setInputCols("sentence")
.setOutputCol("bert_sentence")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, sent_embeddings ))
import nlu
text = ["I love NLP"]
sent_embeddings_df = nlu.load('en.embed_sentence.bert.muril').predict(text, output_level='sentence')
sent_embeddings_df
Model Information
Model Name: | sent_bert_muril |
Compatibility: | Spark NLP 3.2.0+ |
License: | Open Source |
Edition: | Official |
Input Labels: | [sentence] |
Output Labels: | [bert_sentence] |
Language: | xx |
Case sensitive: | false |
Data Source
The model is imported from: https://tfhub.dev/google/MuRIL/1