Description
This model uses Arabic word embeddings to find 4 different types of entities in Arabic text. It is trained using arabic_w2v_cc_300d
word embeddings, so please use the same embeddings in the pipeline.
Predicted Entities
PER
, LOC
, ORG
, MISC
How to use
Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("arabic_w2v_cc_300d", "ar") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = NerDLModel.pretrained("aner_cc_300d", "ar") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlp_pipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
annotations = light_pipeline.fullAnnotate("في عام 1918 حررت قوات الثورة العربية دمشق بمساعدة من الإنكليز")
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("arabic_w2v_cc_300d", "ar",)
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = NerDLModel.pretrained("aner_cc_300d", "ar")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val nlp_pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner,
ner_converter))
val data = Seq("""في عام 1918 حررت قوات الثورة العربية دمشق بمساعدة من الإنكليز""").toDS.toDF("text")
val result = nlp_pipeline.fit(data).transform(data)
import nlu
nlu.load("ar.ner").predict("""في عام 1918 حررت قوات الثورة العربية دمشق بمساعدة من الإنكليز""")
Results
| | ner_chunk | entity |
|---:|-------------------------:|-------------:|
| 0 | قوات الثورة العربية | ORG |
| 1 | دمشق | LOC |
| 2 | الإنكليز | PER |
Model Information
Model Name: | aner_cc_300d |
Type: | ner |
Compatibility: | Spark NLP 4.0.2+ |
License: | Open Source |
Edition: | Official |
Input Labels: | [document, token, word_embeddings] |
Output Labels: | [ner] |
Language: | ar |
Size: | 14.8 MB |
References
This model is trained on data obtained from http://curtis.ml.cmu.edu/w/courses/index.php/ANERcorp
Benchmarking
label tp fp fn prec rec f1
B-LOC 163 28 34 0.853403 0.827411 0.840206
I-ORG 60 10 5 0.857142 0.923077 0.888889
I-MIS 124 53 53 0.700565 0.700565 0.700565
I-LOC 64 20 23 0.761904 0.735632 0.748538
B-MIS 297 71 52 0.807065 0.851003 0.828452
I-PER 84 23 13 0.785046 0.865979 0.823530
B-ORG 54 9 12 0.857142 0.818181 0.837210
B-PER 182 26 33 0.875 0.846512 0.860520
Macro-average 1028 240 225 0.812159 0.821045 0.816578
Micro-average 1028 240 225 0.810726 0.820431 0.815550