Description
- This model is trained on Broad Twitter Corpus (BTC) dataset, so that it can detect entities in Twitter-based texts successfully.
- It’s based on
bert_base_casedembeddings, which are included in the model, so you don’t need to use any embeddings component in the NLP pipeline.
Predicted Entities
PER, LOC, ORG
Live Demo Open in Colab Download Copy S3 URI
How to use
...
tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_btc", "en")\
.setInputCols("token", "document")\
.setOutputCol("ner")\
.setCaseSensitive(True)
ner_converter = NerConverter()\
.setInputCols(["document","token","ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier, ner_converter])
model = pipeline.fit(spark.createDataFrame(pd.DataFrame({'text': ['']})))
test_sentences = ["""Pentagram's Dominic Lippa is working on a new identity for University of Arts London."""]
result = model.transform(spark.createDataFrame(pd.DataFrame({'text': test_sentences})))
...
val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_btc", "en")
.setInputCols("token", "document")
.setOutputCol("ner")
.setCaseSensitive(True)
val ner_converter = NerConverter()
.setInputCols(Array("document","token","ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier, ner_converter))
val data = Seq("Pentagram's Dominic Lippa is working on a new identity for University of Arts London.").toDF("text")
val result = pipeline.fit(data).transform(data)
import nlu
nlu.load("en.classify.token_bert.classifier_ner_btc").predict("""Pentagram's Dominic Lippa is working on a new identity for University of Arts London.""")
Results
+--------------------------+---------+
|chunk |ner_label|
+--------------------------+---------+
|Pentagram's |ORG |
|Dominic Lippa |PER |
|University of Arts London |ORG |
+--------------------------+---------+
Model Information
| Model Name: | bert_token_classifier_ner_btc |
| Compatibility: | Spark NLP 3.2.2+ |
| License: | Open Source |
| Edition: | Official |
| Input Labels: | [sentence, token] |
| Output Labels: | [ner] |
| Language: | en |
| Case sensitive: | true |
| Max sentense length: | 128 |
Data Source
https://github.com/juand-r/entity-recognition-datasets/tree/master/data/BTC
Benchmarking
label precision recall f1-score support
B-LOC 0.90 0.79 0.84 536
B-ORG 0.80 0.79 0.79 821
B-PER 0.95 0.62 0.75 1575
I-LOC 0.96 0.76 0.85 181
I-ORG 0.88 0.81 0.84 217
I-PER 0.99 0.91 0.95 315
O 0.97 0.99 0.98 26217
accuracy - - 0.96 29862
macro-avg 0.92 0.81 0.86 29862
weighted-avg 0.96 0.96 0.96 29862