Description
ner_ontonotes_distilbert_base_cased
is a Named Entity Recognition (or NER) model trained on OntoNotes 5.0. It can extract up to 18 entities such as people, places, organizations, money, time, date, etc.
This model uses the pretrained distilbert_base_cased
model from the DistilBertEmbeddings
annotator as an input.
Predicted Entities
CARDINAL
, DATE
, EVENT
, FAC
, GPE
, LANGUAGE
, LAW
, LOC
, MONEY
, NORP
, ORDINAL
, ORG
, PERCENT
, PERSON
, PRODUCT
, QUANTITY
, TIME
, WORK_OF_ART
Live Demo Open in Colab Download Copy S3 URI
How to use
document_assembler = DocumentAssembler() \
.setInputCol('text') \
.setOutputCol('document')
tokenizer = Tokenizer() \
.setInputCols(['document']) \
.setOutputCol('token')
embeddings = DistilBertEmbeddings\
.pretrained('distilbert_base_cased', 'en')\
.setInputCols(["token", "document"])\
.setOutputCol("embeddings")
ner_model = NerDLModel.pretrained('ner_ontonotes_distilbert_base_cased', 'en') \
.setInputCols(['document', 'token', 'embeddings']) \
.setOutputCol('ner')
ner_converter = NerConverter() \
.setInputCols(['document', 'token', 'ner']) \
.setOutputCol('entities')
pipeline = Pipeline(stages=[
document_assembler,
tokenizer,
embeddings,
ner_model,
ner_converter
])
example = spark.createDataFrame([['My name is John!']]).toDF("text")
result = pipeline.fit(example).transform(example)
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val embeddings = DistilBertEmbeddings.pretrained("distilbert_base_cased", "en")
.setInputCols("document", "token")
.setOutputCol("embeddings")
val ner_model = NerDLModel.pretrained("ner_ontonotes_distilbert_base_cased", "en")
.setInputCols("document"', "token", "embeddings")
.setOutputCol("ner")
val ner_converter = NerConverter()
.setInputCols("document", "token", "ner")
.setOutputCol("entities")
val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, embeddings, ner_model, ner_converter))
val example = Seq.empty["My name is John!"].toDS.toDF("text")
val result = pipeline.fit(example).transform(example)
import nlu
text = ["My name is John!"]
ner_df = nlu.load('en.ner.ner_ontonotes_distilbert_base_cased').predict(text, output_level='token')
Model Information
Model Name: | ner_ontonotes_distilbert_base_cased |
Type: | ner |
Compatibility: | Spark NLP 3.2.0+ |
License: | Open Source |
Edition: | Official |
Input Labels: | [sentence, token, embeddings] |
Output Labels: | [ner] |
Language: | en |
Data Source
https://catalog.ldc.upenn.edu/LDC2013T19
Benchmarking
precision recall f1-score support
B-CARDINAL 0.85 0.85 0.85 935
B-DATE 0.87 0.87 0.87 1602
B-EVENT 0.60 0.56 0.58 63
B-FAC 0.69 0.69 0.69 135
B-GPE 0.96 0.93 0.95 2240
B-LANGUAGE 0.90 0.41 0.56 22
B-LAW 0.74 0.42 0.54 40
B-LOC 0.68 0.80 0.74 179
B-MONEY 0.90 0.92 0.91 314
B-NORP 0.94 0.94 0.94 841
B-ORDINAL 0.84 0.87 0.85 195
B-ORG 0.88 0.89 0.88 1795
B-PERCENT 0.92 0.92 0.92 349
B-PERSON 0.93 0.93 0.93 1988
B-PRODUCT 0.60 0.68 0.64 76
B-QUANTITY 0.80 0.74 0.77 105
B-TIME 0.70 0.57 0.62 212
B-WORK_OF_ART 0.77 0.58 0.66 166
I-CARDINAL 0.77 0.90 0.83 331
I-DATE 0.87 0.92 0.89 2011
I-EVENT 0.61 0.78 0.68 130
I-FAC 0.76 0.81 0.79 213
I-GPE 0.95 0.86 0.90 628
I-LAW 0.90 0.54 0.67 106
I-LOC 0.72 0.80 0.76 180
I-MONEY 0.94 0.98 0.96 685
I-NORP 0.96 0.85 0.90 160
I-ORDINAL 0.00 0.00 0.00 4
I-ORG 0.89 0.92 0.91 2406
I-PERCENT 0.95 0.95 0.95 523
I-PERSON 0.95 0.94 0.94 1412
I-PRODUCT 0.59 0.81 0.68 69
I-QUANTITY 0.88 0.83 0.85 206
I-TIME 0.72 0.65 0.68 255
I-WORK_OF_ART 0.81 0.57 0.67 337
O 0.99 0.99 0.99 131815
accuracy 0.98 152728
macro avg 0.80 0.77 0.78 152728
weighted avg 0.98 0.98 0.98 152728
processed 152728 tokens with 11257 phrases; found: 11127 phrases; correct: 9747.
accuracy: 88.49%; (non-O)
accuracy: 97.78%; precision: 87.60%; recall: 86.59%; FB1: 87.09
CARDINAL: precision: 83.58%; recall: 83.85%; FB1: 83.72 938
DATE: precision: 84.94%; recall: 84.52%; FB1: 84.73 1594
EVENT: precision: 58.62%; recall: 53.97%; FB1: 56.20 58
FAC: precision: 68.66%; recall: 68.15%; FB1: 68.40 134
GPE: precision: 95.96%; recall: 92.37%; FB1: 94.13 2156
LANGUAGE: precision: 90.00%; recall: 40.91%; FB1: 56.25 10
LAW: precision: 69.57%; recall: 40.00%; FB1: 50.79 23
LOC: precision: 65.40%; recall: 77.09%; FB1: 70.77 211
MONEY: precision: 88.79%; recall: 90.76%; FB1: 89.76 321
NORP: precision: 93.45%; recall: 93.34%; FB1: 93.40 840
ORDINAL: precision: 83.74%; recall: 87.18%; FB1: 85.43 203
ORG: precision: 85.34%; recall: 86.57%; FB1: 85.95 1821
PERCENT: precision: 89.02%; recall: 88.25%; FB1: 88.63 346
PERSON: precision: 91.40%; recall: 91.45%; FB1: 91.43 1989
PRODUCT: precision: 58.14%; recall: 65.79%; FB1: 61.73 86
QUANTITY: precision: 78.79%; recall: 74.29%; FB1: 76.47 99
TIME: precision: 65.70%; recall: 53.30%; FB1: 58.85 172
WORK_OF_ART: precision: 71.43%; recall: 54.22%; FB1: 61.64 126