Detect Entities in Urdu (urduvec_140M_300d embeddings)

Description

This model uses Urdu word embeddings to find 7 different types of entities in Urdu text. It is trained using urduvec_140M_300d word embeddings, so please use the same embeddings in the pipeline. Predicted Entities : Persons-PER, Locations-LOC, Organizations-ORG, Dates-DATE, Designations-DESIGNATION, Times-TIME, Numbers-NUMBER.

Predicted Entities

PER, LOC, ORG, DATE, TIME, DESIGNATION, NUMBER

Download Copy S3 URI

How to use

Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.

document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
        
sentence_detector = SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("urduvec_140M_300d", "ur") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner = NerDLModel.pretrained("uner_mk_140M_300d", "ur") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverter()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")

nlp_pipeline = Pipeline(stages=[document_assembler, 
                                sentence_detector, 
                                tokenizer, 
                                word_embeddings, 
                                ner, 
                                ner_converter])

light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))

annotations = light_pipeline.fullAnnotate("""بریگیڈیئر ایڈ بٹلر سنہ دوہزارچھ میں ہلمند کے فوجی کمانڈر تھے۔""")
val documentAssembler = new DocumentAssembler()
		.setInputCol("text")
		.setOutputCol("document")

val sentenceDetector = new SentenceDetector()
		.setInputCols(Array("document"))
		.setOutputCol("sentence")

val tokenizer = new Tokenizer()
		.setInputCols(Array("sentence"))
		.setOutputCol("token")
	
val embeddings = WordEmbeddingsModel.pretrained("urduvec_140M_300d", "ur")
		.setInputCols(Array("sentence", "token"))
	  .setOutputCol("embeddings")

val ner = NerDLModel.pretrained("uner_mk_140M_300d", "ur")
		.setInputCols(Array("sentence", "token", "embeddings"))
		.setOutputCol("ner")

val ner_converter = new NerConverter()
		.setInputCols(Array("sentence", "token", "ner"))
		.setOutputCol("ner_chunk")

val nlp_pipeline  = new Pipeline().setStages(Array(
					documentAssembler, 
					sentenceDetector, 
					tokenizer, 
					embeddings, 
					ner, 
					ner_converter))

val data = Seq("""بریگیڈیئر ایڈ بٹلر سنہ دوہزارچھ میں ہلمند کے فوجی کمانڈر تھے۔""").toDS.toDF("text")

val result = nlp_pipeline.fit(data).transform(data)
import nlu
nlu.load("ur.ner").predict("""بریگیڈیئر ایڈ بٹلر سنہ دوہزارچھ میں ہلمند کے فوجی کمانڈر تھے۔""")

Results

|    | ner_chunk      | entity       |
|---:|---------------:|-------------:|
|  0 |بریگیڈیئر          | DESIGNATION  |
|  1 |ایڈ بٹلر           | PERSON       |
|  2 |سنہ دوہزارچھ       | DATE         |
|  3 |ہلمند             | LOCATION     |

Model Information

Model Name: uner_mk_140M_300d
Type: ner
Compatibility: Spark NLP 4.0.2+
License: Open Source
Edition: Official
Input Labels: [document, token, word_embeddings]
Output Labels: [ner]
Language: ur
Size: 14.8 MB

References

This model is trained using the following datasets: https://www.researchgate.net/publication/312218764_Named_Entity_Dataset_for_Urdu_Named_Entity_Recognition_Task https://www.researchgate.net/publication/332653135_Urdu_Named_Entity_Recognition_Corpus_Generation_and_Deep_Learning_Applications

Benchmarking

label               tp     fp    fn      prec       rec        f1
I-TIME              12     10     1  0.545455  0.923077  0.685714
B-PERSON          2808    846   535  0.768473  0.839964  0.80263 
B-DATE              34      6     6  0.85      0.85      0.85    
I-DATE              45      1     2  0.978261  0.957447  0.967742
B-DESIGNATION       49     30    16  0.620253  0.753846  0.680556
I-LOCATION        2110    750   701  0.737762  0.750623  0.744137
B-TIME              11      9     3  0.55      0.785714  0.647059
I-ORGANIZATION    2006    772   760  0.722102  0.725235  0.723665
I-NUMBER            18      6     2  0.75      0.9       0.818182
B-LOCATION        5428   1255   582  0.81221   0.903161  0.855275
B-NUMBER           194     36    27  0.843478  0.877828  0.86031 
I-DESIGNATION       25     15     6  0.625     0.806452  0.704225
I-PERSON          3562    759   433  0.824346  0.891614  0.856662
B-ORGANIZATION    1114    466   641  0.705063  0.634758  0.668066
Macro-average    17416   4961  3715  0.738029  0.828551  0.780675
Micro-average    17416   4961  3715  0.778299  0.824192  0.800588