Japanese Word Segmentation

Description

WordSegmenterModel (WSM) is based on maximum entropy probability model to detect word boundaries in Chinese text. Chinese text is written without white space between the words, and a computer-based application cannot know a priori which sequence of ideograms form a word. In many natural language processing tasks such as part-of-speech (POS) and named entity recognition (NER) require word segmentation as a initial step.

References:

  • Xue, Nianwen. “Chinese word segmentation as character tagging.” International Journal of Computational Linguistics & Chinese Language Processing, Volume 8, Number 1, February 2003: Special Issue on Word Formation and Chinese Language Processing. 2003.).

Download Copy S3 URI

How to use

Use as part of an nlp pipeline as a substitute of the Tokenizer stage.

document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")
    
word_segmenter = WordSegmenterModel.pretrained('wordseg_gsd_ud', 'ja')\
.setInputCols("document")\
.setOutputCol("token")     
pipeline = Pipeline(stages=[
document_assembler,
word_segmenter
])
model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
example = spark.createDataFrame([['清代は湖北省が置かれ、そのまま現代の行政区分になっている。']], ["text"])
result = model.transform(example)
val document_assembler = DocumentAssembler()
        .setInputCol("text")
        .setOutputCol("document")
        
val word_segmenter = WordSegmenterModel.pretrained("wordseg_gsd_ud", "ja")
.setInputCols("document")
.setOutputCol("token")
val pipeline = new Pipeline().setStages(Array(document_assembler, word_segmenter))
val data = Seq("清代は湖北省が置かれ、そのまま現代の行政区分になっている。").toDF("text")
val result = pipeline.fit(data).transform(data)
import nlu

text = ["""清代は湖北省が置かれ、そのまま現代の行政区分になっている。"""]
token_df = nlu.load('ja.segment_words').predict(text, output_level='token')
token_df

Results

+----------------------------------------------------------+------------------------------------------------------------------------------------------------+
|text                                                      |result                                                                                          |
+----------------------------------------------------------+------------------------------------------------------------------------------------------------+
|清代は湖北省が置かれ、そのまま現代の行政区分になっている。|[清代, は, 湖北, 省, が, 置か, れ, 、, その, まま, 現代, の, 行政, 区分, に, なっ, て, いる, 。]|
+----------------------------------------------------------+------------------------------------------------------------------------------------------------+

Model Information

Model Name: wordseg_gsd_ud
Compatibility: Spark NLP 2.7.0+
Edition: Official
Input Labels: [document]
Output Labels: [token]
Language: ja

Data Source

We trained this model on the the Universal Dependenicies data set from Google (GSD-UD).

Asahara, M., Kanayama, H., Tanaka, T., Miyao, Y., Uematsu, S., Mori, S., Matsumoto, Y., Omura, M., & Murawaki, Y. (2018). Universal Dependencies Version 2 for Japanese. In LREC-2018.

Benchmarking

| Model         | precision    | recall       | f1-score     |
|---------------|--------------|--------------|--------------|
| JA_UD-GSD     |      0,7687  |      0,8048  |      0,7863  |