Detect Person, Location, Organization, and Miscellaneous entities in Arabic (ANERcorp)

Description

This model uses Arabic word embeddings to find 4 different types of entities in Arabic text. It is trained using arabic_w2v_cc_300d word embeddings, so please use the same embeddings in the pipeline.

Predicted Entities

Download Copy S3 URI

How to use

Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.

word_embeddings = WordEmbeddingsModel.pretrained("arabic_w2v_cc_300d", "ar") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("embeddings")

ner = NerDLModel.pretrained("aner_cc_300d", "ar") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverter()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")

nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter])

light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))

annotations = light_pipeline.fullAnnotate("في عام 1918 حررت قوات الثورة العربية دمشق بمساعدة من الإنكليز")

val embeddings = WordEmbeddingsModel.pretrained("arabic_w2v_cc_300d", "ar") 
    .setInputCols(Array("document", "token"))
    .setOutputCol("embeddings")

val ner_model = NerDLModel.pretrained("aner_cc_300d", "ar")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner")

val ner_converter = NerConverter()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")

val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter))

val data = Seq("في عام 1918 حررت قوات الثورة العربية دمشق بمساعدة من الإنكليز").toDF("text")

val result = pipeline.fit(data).transform(data)

import nlu
nlu.load("ar.ner").predict("""في عام 1918 حررت قوات الثورة العربية دمشق بمساعدة من الإنكليز""")

Results

|    | ner_chunk                | entity       |
|---:|-------------------------:|-------------:|
|  0 | قوات الثورة العربية    | ORG          |
|  1 | دمشق                    | LOC          |
|  2 | الإنكليز                 | PER          |

Model Information

Model Name:	aner_cc_300d
Type:	ner
Compatibility:	Spark NLP 4.0.0+
License:	Open Source
Edition:	Official
Input Labels:	[document, token, word_embeddings]
Output Labels:	[ner]
Language:	ar
Size:	14.9 MB
Dependencies:	arabic_w2v_cc_300d

References

This model is trained on data obtained from http://curtis.ml.cmu.edu/w/courses/index.php/ANERcorp

Benchmarking

label               tp     fp    fn      prec       rec        f1 
B-LOC              163     28    34  0.853403  0.827411  0.840206 
I-ORG               60     10     5  0.857142  0.923077  0.888889 
I-MIS              124     53    53  0.700565  0.700565  0.700565 
I-LOC               64     20    23  0.761904  0.735632  0.748538 
B-MIS              297     71    52  0.807065  0.851003  0.828452 
I-PER               84     23    13  0.785046  0.865979  0.823530 
B-ORG               54      9    12  0.857142  0.818181  0.837210 
B-PER              182     26    33  0.875     0.846512  0.860520 
Macro-average     1028    240   225  0.812159  0.821045  0.816578 
Micro-average     1028    240   225  0.810726  0.820431  0.815550 

PREVIOUSEnglish BertForTokenClassification Cased model (from ysharma)

NEXTDetect 10 Different Entities in Hebrew (hebrewner_cc_300d)