Detect Persons, Locations, Organizations, Dates, Time, Numbers, and Designation Entities in Urdu (urduvec_140M_300d)

Description

This model uses Urdu word embeddings to find 7 different types of entities in Urdu text. It is trained using urduvec_140M_300d word embeddings, so please use the same embeddings in the pipeline.

Predicted Entities

Download Copy S3 URI

How to use

Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.

word_embeddings = WordEmbeddingsModel.pretrained("urduvec_140M_300d", "ur") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("embeddings")

ner = NerDLModel.pretrained("uner_mk_140M_300d", "ur" ) \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverter()\
    .setInputCols(["sentence", "token", "ner"])}\
    .setOutputCol("ner_chunk")

nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter])

light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))

annotations = light_pipeline.fullAnnotate("بریگیڈیئر ایڈ بٹلر سنہ دوہزارچھ میں ہلمند کے فوجی کمانڈر تھے۔")

val embeddings = WordEmbeddingsModel.pretrained("urduvec_140M_300d", "ur")
    .setInputCols(Array("document", "token"))
    .setOutputCol("embeddings")

val ner_model = NerDLModel.pretrained("uner_mk_140M_300d", "ur")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner")

val ner_converter = NerConverter()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")

val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter))

val data = Seq("بریگیڈیئر ایڈ بٹلر سنہ دوہزارچھ میں ہلمند کے فوجی کمانڈر تھے۔").toDF("text")

val result = pipeline.fit(data).transform(data)

import nlu
nlu.load("ur.ner").predict("""بریگیڈیئر ایڈ بٹلر سنہ دوہزارچھ میں ہلمند کے فوجی کمانڈر تھے۔""")

Results

|    | ner_chunk      | entity       |
|---:|---------------:|-------------:|
|  0 |بریگیڈیئر      | DESIGNATION  |
|  1 |ایڈ بٹلر       | PERSON       |
|  2 |سنہ دوہزارچھ   | DATE         |
|  3 |ہلمند           | LOCATION     |

Model Information

Model Name:	uner_mk_140M_300d
Type:	ner
Compatibility:	Spark NLP 4.0.0+
License:	Open Source
Edition:	Official
Input Labels:	[document, token, word_embeddings]
Output Labels:	[ner]
Language:	ur
Size:	14.9 MB
Dependencies:	urduvec_140M_300d

References

This model is trained using the following datasets: https://www.researchgate.net/publication/312218764_Named_Entity_Dataset_for_Urdu_Named_Entity_Recognition_Task https://www.researchgate.net/publication/332653135_Urdu_Named_Entity_Recognition_Corpus_Generation_and_Deep_Learning_Applications

Benchmarking

label               tp     fp    fn      prec       rec        f1 
I-TIME              12     10     1  0.545455  0.923077  0.685714 
B-PERSON          2808    846   535  0.768473  0.839964  0.80263  
B-DATE              34      6     6  0.85      0.85      0.85     
I-DATE              45      1     2  0.978261  0.957447  0.967742 
B-DESIGNATION       49     30    16  0.620253  0.753846  0.680556 
I-LOCATION        2110    750   701  0.737762  0.750623  0.744137 
B-TIME              11      9     3  0.55      0.785714  0.647059 
I-ORGANIZATION    2006    772   760  0.722102  0.725235  0.723665 
I-NUMBER            18      6     2  0.75      0.9       0.818182 
B-LOCATION        5428   1255   582  0.81221   0.903161  0.855275 
B-NUMBER           194     36    27  0.843478  0.877828  0.86031  
I-DESIGNATION       25     15     6  0.625     0.806452  0.704225 
I-PERSON          3562    759   433  0.824346  0.891614  0.856662 
B-ORGANIZATION    1114    466   641  0.705063  0.634758  0.668066 
Macro-average   17416   4961   3715  0.738029  0.828551  0.780675 
Micro-average   17416   4961   3715  0.778299  0.824192  0.800588 

PREVIOUSDetect 10 Different Entities in Hebrew (hebrewner_cc_300d)

NEXTEnglish BertForTokenClassification Base Cased model (from tner)