Chinese Word Segmentation

Description

WordSegmenterModel (WSM) is based on maximum entropy probability model to detect word boundaries in Chinese text. Chinese text is written without white space between the words, and a computer-based application cannot know a priori which sequence of ideograms form a word. In many natural language processing tasks such as part-of-speech (POS) and named entity recognition (NER) require word segmentation as a initial step.

In this model, we created a curated large data set obtained from Chinese Treebank, Weibo, and SIGHAM 2005 data sets, and trained the neural network model as described in a research paper (Xue, Nianwen. “Chinese word segmentation as character tagging.” International Journal of Computational Linguistics & Chinese Language Processing, Volume 8, Number 1, February 2003: Special Issue on Word Formation and Chinese Language Processing. 2003.).

Download Copy S3 URI

How to use

document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")
    
word_segmenter = WordSegmenterModel.load("WORDSEG_LARGE_CN")\
        .setInputCols("document")\
        .setOutputCol("token")\
pipeline = Pipeline(stages=[document_assembler, word_segmenter])
ws_model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
example = spark.createDataFrame([['然而，这样的处理也衍生了一些问题。']], ["text"])
result = ws_model.transform(example)

val document_assembler = DocumentAssembler()
        .setInputCol("text")
        .setOutputCol("document")
        
val word_segmenter = WordSegmenterModel.pretrained("wordseg_large", "zh")
        .setInputCols("document")
        .setOutputCol("token")
val pipeline = new Pipeline().setStages(Array(document_assembler, word_segmenter))
val data = Seq("然而，这样的处理也衍生了一些问题。").toDF("text")
val result = pipeline.fit(data).transform(data)

import nlu

text = ["""然而，这样的处理也衍生了一些问题。"""]
token_df = nlu.load('zh.segment_words.large').predict(text, output_level='token')
token_df

Results

+----------------------------------+--------------------------------------------------------+
|text                              |result                                                  |
+----------------------------------+--------------------------------------------------------+
|然而，这样的处理也衍生了一些问题。|[然而, ，, 这样, 的, 处理, 也, 衍生, 了, 一些, 问题, 。]|
+----------------------------------+--------------------------------------------------------+

Model Information

Model Name:	wordseg_large
Compatibility:	Spark NLP 2.7.0+
Edition:	Official
Input Labels:	[document]
Output Labels:	[token]
Language:	zh

Data Source

cn_wordseg_large_train.chartag

Benchmarking

| Model         | precision    | recall       | f1-score     |
|---------------|--------------|--------------|--------------|
| WORSEG_CTB    |      0,6453  |      0,6341  |      0,6397  |
| WORDSEG_WEIBO |      0,5454  |      0,5655  |      0,5553  |
| WORDSEG_MSRA   |      0,5984  |      0,6088  |      0,6035  |
| WORDSEG_PKU   |      0,6094  |      0,6321  |      0,6206  |
| WORDSEG_LARGE |      0,6326  |      0,6269  |      0,6297  |

PREVIOUSKorean Word Segmentation

NEXTChinese Word Segmentation