Description
-
This model is imported from
Hugging Face. -
It’s been trained using
xlm_roberta_largefine-tuned model on 10 African languages (Amharic, Hausa, Igbo, Kinyarwanda, Luganda, Nigerian, Pidgin, Swahilu, Wolof, and Yorùbá).
Predicted Entities
DATE, LOC, PER, ORG
Live Demo Open in Colab Download Copy S3 URI
How to use
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlm_roberta_large_token_classifier_masakhaner", "xx"))\
.setInputCols(["sentence",'token'])\
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_data)
text = """አህመድ ቫንዳ ከ3-10-2000 ጀምሮ በአዲስ አበባ ኖሯል።"""
result = model.transform(spark.createDataFrame([[text]]).toDF("text"))
val documentAssembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlm_roberta_large_token_classifier_masakhaner", "xx"))
.setInputCols(Array("sentence","token"))
.setOutputCol("ner")
ner_converter = NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter))
val example = Seq.empty["አህመድ ቫንዳ ከ3-10-2000 ጀምሮ በአዲስ አበባ ኖሯል።"].toDS.toDF("text")
val result = pipeline.fit(example).transform(example)
import nlu
nlu.load("xx.ner.masakhaner").predict("""አህመድ ቫንዳ ከ3-10-2000 ጀምሮ በአዲስ አበባ ኖሯል።""")
Results
+----------------+---------+
|chunk |ner_label|
+----------------+---------+
|አህመድ ቫንዳ |PER |
|ከ3-10-2000 ጀምሮ|DATE |
|በአዲስ አበባ |LOC |
+----------------+---------+
Model Information
| Model Name: | xlm_roberta_large_token_classifier_masakhaner |
| Compatibility: | Spark NLP 3.3.2+ |
| License: | Open Source |
| Edition: | Official |
| Input Labels: | [sentence, token] |
| Output Labels: | [ner] |
| Language: | xx |
| Case sensitive: | true |
| Max sentense length: | 256 |
Data Source
https://huggingface.co/Davlan/xlm-roberta-large-masakhaner
Benchmarking
language: F1-score:
-------- --------
amh 75.76
hau 91.75
ibo 86.26
kin 76.38
lug 84.64
luo 80.65
pcm 89.55
swa 89.48
wol 70.70
yor 82.05
PREVIOUSIcelandic NER Model