Named Entity Recognition for Chinese (BERT-Weibo Dataset)

Description

This model annotates named entities in a text, that can be used to find features such as names of people, places, and organizations. The model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together.

This model uses the pre-trained bert_base_chinese embeddings model from BertEmbeddings annotator as an input, so be sure to use the same embeddings in the pipeline.

Predicted Entities

Tag	Meaning	Example
PER.NAM	Person name	张三
PER.NOM	Code, category	穷人
LOC.NAM	Specific location	紫玉山庄
LOC.NOM	Generic location	大峡谷、宾馆
GPE.NAM	Administrative regions and areas	北京
ORG.NAM	Specific organization	通惠医院
ORG.NOM	Generic or collective organization	文艺公司

Live Demo Open in Colab Download Copy S3 URI

How to use

document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")
    
sentence_detector = SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

word_segmenter = WordSegmenterModel.pretrained("wordseg_large", "zh")\
        .setInputCols(["sentence"])\
        .setOutputCol("token")
embeddings = BertEmbeddings.pretrained(name='bert_base_chinese', lang='zh')\
          .setInputCols("document", "token") \
          .setOutputCol("embeddings")
ner = NerDLModel.pretrained("ner_weibo_bert_768d", "zh") \
        .setInputCols(["document", "token", "embeddings"]) \
        .setOutputCol("ner")
...
pipeline = Pipeline(stages=[document_assembler, sentence_detector, word_segmenter, embeddings, ner, ner_converter])
example = spark.createDataFrame([['张三去中国山东省泰安市爬中国五岳的泰山了']], ["text"])
result = pipeline.fit(example).transform(example)

val document_assembler = DocumentAssembler()
        .setInputCol("text")
        .setOutputCol("document")
        
val sentence_detector = SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")


val word_segmenter = WordSegmenterModel.pretrained("wordseg_large", "zh")
     .setInputCols(Array("sentence"))
     .setOutputCol("token")
val embeddings = BertEmbeddings.pretrained(name='bert_base_chinese', lang='zh')
     .setInputCols(Array("document", "token"))
     .setOutputCol("embeddings")
val ner = NerDLModel.pretrained("ner_weibo_bert_768d", "zh")
     .setInputCols(Array("document", "token", "embeddings"))
     .setOutputCol("ner")
...
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, word_segmenter, embeddings, ner))
val data = Seq("张三去中国山东省泰安市爬中国五岳的泰山了").toDF("text")
val result = pipeline.fit(data).transform(data)

import nlu
text = ["张三去中国山东省泰安市爬中国五岳的泰山了"]

ner_df = nlu.load('zh.ner.weibo.bert_768d').predict(text)
ner_df

Results

+--------+-------+
|token   |ner    |
+--------+-------+
|张三    |PER.NAM|
|中国    |GPE.NAM|
|山东省  |GPE.NAM|
|中国五岳|GPE.NAM|
|泰山    |GPE.NAM|
+--------+-------+

Model Information

Model Name:	ner_weibo_bert_768d
Type:	ner
Compatibility:	Spark NLP 2.7.0+
License:	Open Source
Edition:	Official
Input Labels:	[sentence, token, embeddings]
Output Labels:	[ner]
Language:	zh

Data Source

The model was trained on the Weibo NER (He and Sun, 2017) data set.

Benchmarking

| ner_tag      | precision | recall | f1-score | support |
|--------------|-----------|--------|----------|---------|
| GPE.NAM      | 0.73      | 0.66   | 0.69     | 50      |
| GPE.NOM      | 0.00      | 0.00   | 0.00     | 2       |
| LOC.NAM      | 0.60      | 0.10   | 0.18     | 29      |
| LOC.NOM      | 0.20      | 0.10   | 0.13     | 10      |
| ORG.NAM      | 0.53      | 0.15   | 0.23     | 60      |
| ORG.NOM      | 0.50      | 0.28   | 0.36     | 18      |
| PER.NAM      | 0.66      | 0.61   | 0.63     | 139     |
| PER.NOM      | 0.68      | 0.63   | 0.66     | 197     |
| accuracy     | 0.96      | 9110   |          |         |
| macro avg    | 0.54      | 0.39   | 0.43     | 9110    |
| weighted avg | 0.96      | 0.96   | 0.96     | 9110    |

PREVIOUSTranslate West Slavic languages to English Pipeline

NEXTTREC(50) Question Classifier