Description
This model uses Urdu word embeddings to find 7 different types of entities in Urdu text. It is trained using urduvec_140M_300d
word embeddings, so please use the same embeddings in the pipeline.
Predicted Entities : Persons-PER
, Locations-LOC
, Organizations-ORG
, Dates-DATE
, Designations-DESIGNATION
, Times-TIME
, Numbers-NUMBER
.
Predicted Entities
PER
, LOC
, ORG
, DATE
, TIME
, DESIGNATION
, NUMBER
How to use
Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("urduvec_140M_300d", "ur") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = NerDLModel.pretrained("uner_mk_140M_300d", "ur") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlp_pipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
annotations = light_pipeline.fullAnnotate("""بریگیڈیئر ایڈ بٹلر سنہ دوہزارچھ میں ہلمند کے فوجی کمانڈر تھے۔""")
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("urduvec_140M_300d", "ur")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = NerDLModel.pretrained("uner_mk_140M_300d", "ur")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val nlp_pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner,
ner_converter))
val data = Seq("""بریگیڈیئر ایڈ بٹلر سنہ دوہزارچھ میں ہلمند کے فوجی کمانڈر تھے۔""").toDS.toDF("text")
val result = nlp_pipeline.fit(data).transform(data)
import nlu
nlu.load("ur.ner").predict("""بریگیڈیئر ایڈ بٹلر سنہ دوہزارچھ میں ہلمند کے فوجی کمانڈر تھے۔""")
Results
| | ner_chunk | entity |
|---:|---------------:|-------------:|
| 0 |بریگیڈیئر | DESIGNATION |
| 1 |ایڈ بٹلر | PERSON |
| 2 |سنہ دوہزارچھ | DATE |
| 3 |ہلمند | LOCATION |
Model Information
Model Name: | uner_mk_140M_300d |
Type: | ner |
Compatibility: | Spark NLP 4.0.2+ |
License: | Open Source |
Edition: | Official |
Input Labels: | [document, token, word_embeddings] |
Output Labels: | [ner] |
Language: | ur |
Size: | 14.8 MB |
References
This model is trained using the following datasets: https://www.researchgate.net/publication/312218764_Named_Entity_Dataset_for_Urdu_Named_Entity_Recognition_Task https://www.researchgate.net/publication/332653135_Urdu_Named_Entity_Recognition_Corpus_Generation_and_Deep_Learning_Applications
Benchmarking
label tp fp fn prec rec f1
I-TIME 12 10 1 0.545455 0.923077 0.685714
B-PERSON 2808 846 535 0.768473 0.839964 0.80263
B-DATE 34 6 6 0.85 0.85 0.85
I-DATE 45 1 2 0.978261 0.957447 0.967742
B-DESIGNATION 49 30 16 0.620253 0.753846 0.680556
I-LOCATION 2110 750 701 0.737762 0.750623 0.744137
B-TIME 11 9 3 0.55 0.785714 0.647059
I-ORGANIZATION 2006 772 760 0.722102 0.725235 0.723665
I-NUMBER 18 6 2 0.75 0.9 0.818182
B-LOCATION 5428 1255 582 0.81221 0.903161 0.855275
B-NUMBER 194 36 27 0.843478 0.877828 0.86031
I-DESIGNATION 25 15 6 0.625 0.806452 0.704225
I-PERSON 3562 759 433 0.824346 0.891614 0.856662
B-ORGANIZATION 1114 466 641 0.705063 0.634758 0.668066
Macro-average 17416 4961 3715 0.738029 0.828551 0.780675
Micro-average 17416 4961 3715 0.778299 0.824192 0.800588