This model uses Urdu word embeddings to find 7 different types of entities in Urdu text. It is trained using urduvec_140M_300d word embeddings, so please use the same embeddings in the pipeline.
Predicted Entities
How to use
Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
word_embeddings = WordEmbeddingsModel.pretrained("urduvec_140M_300d", "ur") \
.setInputCols(["document", "token"]) \
ner = NerDLModel.pretrained("uner_mk_140M_300d", "ur" ) \
.setInputCols(["sentence", "token", "embeddings"]) \
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])}\
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter])
light_pipeline = LightPipeline([['']]).toDF("text")))
annotations = light_pipeline.fullAnnotate("بریگیڈیئر ایڈ بٹلر سنہ دوہزارچھ میں ہلمند کے فوجی کمانڈر تھے۔")
val embeddings = WordEmbeddingsModel.pretrained("urduvec_140M_300d", "ur")
.setInputCols(Array("document", "token"))
val ner_model = NerDLModel.pretrained("uner_mk_140M_300d", "ur")
.setInputCols(Array("sentence", "token", "embeddings"))
val ner_converter = NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter))
val data = Seq("بریگیڈیئر ایڈ بٹلر سنہ دوہزارچھ میں ہلمند کے فوجی کمانڈر تھے۔").toDF("text")
val result =
import nlu
nlu.load("ur.ner").predict("""بریگیڈیئر ایڈ بٹلر سنہ دوہزارچھ میں ہلمند کے فوجی کمانڈر تھے۔""")
| | ner_chunk | entity |
| 0 |بریگیڈیئر | DESIGNATION |
| 1 |ایڈ بٹلر | PERSON |
| 2 |سنہ دوہزارچھ | DATE |
| 3 |ہلمند | LOCATION |
Model Information
Model Name: | uner_mk_140M_300d |
Type: | ner |
Compatibility: | Spark NLP 4.0.0+ |
License: | Open Source |
Edition: | Official |
Input Labels: | [document, token, word_embeddings] |
Output Labels: | [ner] |
Language: | ur |
Size: | 14.9 MB |
Dependencies: | urduvec_140M_300d |
This model is trained using the following datasets:
label tp fp fn prec rec f1
I-TIME 12 10 1 0.545455 0.923077 0.685714
B-PERSON 2808 846 535 0.768473 0.839964 0.80263
B-DATE 34 6 6 0.85 0.85 0.85
I-DATE 45 1 2 0.978261 0.957447 0.967742
B-DESIGNATION 49 30 16 0.620253 0.753846 0.680556
I-LOCATION 2110 750 701 0.737762 0.750623 0.744137
B-TIME 11 9 3 0.55 0.785714 0.647059
I-ORGANIZATION 2006 772 760 0.722102 0.725235 0.723665
I-NUMBER 18 6 2 0.75 0.9 0.818182
B-LOCATION 5428 1255 582 0.81221 0.903161 0.855275
B-NUMBER 194 36 27 0.843478 0.877828 0.86031
I-DESIGNATION 25 15 6 0.625 0.806452 0.704225
I-PERSON 3562 759 433 0.824346 0.891614 0.856662
B-ORGANIZATION 1114 466 641 0.705063 0.634758 0.668066
Macro-average 17416 4961 3715 0.738029 0.828551 0.780675
Micro-average 17416 4961 3715 0.778299 0.824192 0.800588