Description
This model uses Arabic word embeddings to find 4 different types of entities in Arabic text. It is trained using arabic_w2v_cc_300d word embeddings, so please use the same embeddings in the pipeline.
Predicted Entities
PER, LOC, ORG, MISC
How to use
Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("arabic_w2v_cc_300d", "ar") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = NerDLModel.pretrained("aner_cc_300d", "ar") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlp_pipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
annotations = light_pipeline.fullAnnotate("في عام 1918 حررت قوات الثورة العربية دمشق بمساعدة من الإنكليز")
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("arabic_w2v_cc_300d", "ar",)
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = NerDLModel.pretrained("aner_cc_300d", "ar")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val nlp_pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner,
ner_converter))
val data = Seq("""في عام 1918 حررت قوات الثورة العربية دمشق بمساعدة من الإنكليز""").toDS.toDF("text")
val result = nlp_pipeline.fit(data).transform(data)
import nlu
nlu.load("ar.ner").predict("""في عام 1918 حررت قوات الثورة العربية دمشق بمساعدة من الإنكليز""")
Results
| | ner_chunk | entity |
|---:|-------------------------:|-------------:|
| 0 | قوات الثورة العربية | ORG |
| 1 | دمشق | LOC |
| 2 | الإنكليز | PER |
Model Information
| Model Name: | aner_cc_300d |
| Type: | ner |
| Compatibility: | Spark NLP 4.0.2+ |
| License: | Open Source |
| Edition: | Official |
| Input Labels: | [document, token, word_embeddings] |
| Output Labels: | [ner] |
| Language: | ar |
| Size: | 14.8 MB |
References
This model is trained on data obtained from http://curtis.ml.cmu.edu/w/courses/index.php/ANERcorp
Benchmarking
label tp fp fn prec rec f1
B-LOC 163 28 34 0.853403 0.827411 0.840206
I-ORG 60 10 5 0.857142 0.923077 0.888889
I-MIS 124 53 53 0.700565 0.700565 0.700565
I-LOC 64 20 23 0.761904 0.735632 0.748538
B-MIS 297 71 52 0.807065 0.851003 0.828452
I-PER 84 23 13 0.785046 0.865979 0.823530
B-ORG 54 9 12 0.857142 0.818181 0.837210
B-PER 182 26 33 0.875 0.846512 0.860520
Macro-average 1028 240 225 0.812159 0.821045 0.816578
Micro-average 1028 240 225 0.810726 0.820431 0.815550