Detect Person, Location, Organization, and Miscellaneous entities in Arabic (ANERcorp)

Description

This model uses Arabic word embeddings to find 4 different types of entities in Arabic text. It is trained using arabic_w2v_cc_300d word embeddings, so please use the same embeddings in the pipeline.

Predicted Entities

PER, LOC, ORG, MISC

Download Copy S3 URI

How to use

Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.

document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
        
sentence_detector = SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("arabic_w2v_cc_300d", "ar") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner = NerDLModel.pretrained("aner_cc_300d", "ar") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverter()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")

nlp_pipeline = Pipeline(stages=[document_assembler, 
                                sentence_detector, 
                                tokenizer, 
                                word_embeddings, 
                                ner, 
                                ner_converter])

light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))

annotations = light_pipeline.fullAnnotate("في عام 1918 حررت قوات الثورة العربية دمشق بمساعدة من الإنكليز")

val documentAssembler = new DocumentAssembler()
		.setInputCol("text")
		.setOutputCol("document")

val sentenceDetector = new SentenceDetector()
		.setInputCols(Array("document"))
		.setOutputCol("sentence")

val tokenizer = new Tokenizer()
		.setInputCols(Array("sentence"))
		.setOutputCol("token")
	
val embeddings = WordEmbeddingsModel.pretrained("arabic_w2v_cc_300d", "ar",)
		.setInputCols(Array("sentence", "token"))
	  .setOutputCol("embeddings")

val ner = NerDLModel.pretrained("aner_cc_300d", "ar")
		.setInputCols(Array("sentence", "token", "embeddings"))
		.setOutputCol("ner")

val ner_converter = new NerConverter()
		.setInputCols(Array("sentence", "token", "ner"))
		.setOutputCol("ner_chunk")

val nlp_pipeline  = new Pipeline().setStages(Array(
					documentAssembler, 
					sentenceDetector, 
					tokenizer, 
					embeddings, 
					ner, 
					ner_converter))

val data = Seq("""في عام 1918 حررت قوات الثورة العربية دمشق بمساعدة من الإنكليز""").toDS.toDF("text")

val result = nlp_pipeline.fit(data).transform(data)

import nlu
nlu.load("ar.ner").predict("""في عام 1918 حررت قوات الثورة العربية دمشق بمساعدة من الإنكليز""")

Results

|    | ner_chunk                | entity       |
|---:|-------------------------:|-------------:|
|  0 | قوات الثورة العربية             | ORG          |
|  1 | دمشق                      | LOC          |
|  2 | الإنكليز                     | PER          |

Model Information

Model Name:	aner_cc_300d
Type:	ner
Compatibility:	Spark NLP 4.0.2+
License:	Open Source
Edition:	Official
Input Labels:	[document, token, word_embeddings]
Output Labels:	[ner]
Language:	ar
Size:	14.8 MB

References

This model is trained on data obtained from http://curtis.ml.cmu.edu/w/courses/index.php/ANERcorp

Benchmarking

label            tp   fp   fn      prec       rec        f1
B-LOC           163   28   34  0.853403  0.827411  0.840206
I-ORG            60   10    5  0.857142  0.923077  0.888889
I-MIS           124   53   53  0.700565  0.700565  0.700565
I-LOC            64   20   23  0.761904  0.735632  0.748538
B-MIS           297   71   52  0.807065  0.851003  0.828452
I-PER            84   23   13  0.785046  0.865979  0.823530
B-ORG            54    9   12  0.857142  0.818181  0.837210
B-PER           182   26   33  0.875     0.846512  0.860520
Macro-average  1028  240  225  0.812159  0.821045  0.816578
Micro-average  1028  240  225  0.810726  0.820431  0.815550

PREVIOUSEnglish BertForTokenClassification Tiny Cased model (from kktoto)

NEXTDetect 10 Different Entities in Hebrew (hebrew_cc_300d embeddings)