Thai Word Segmentation

Description

WordSegmenterModel (WSM) is based on maximum entropy probability model to detect word boundaries in Thai text. Thai text is written without white space between the words, and a computer-based application cannot know a priori which sequence of ideograms form a word. In many natural language processing tasks such as part-of-speech (POS) and named entity recognition (NER) require word segmentation as a initial step.

References:

  • Xue, Nianwen. “Chinese word segmentation as character tagging.” International Journal of Computational Linguistics & Chinese Language Processing, Volume 8, Number 1, February 2003: Special Issue on Word Formation and Chinese Language Processing. 2003.).

Download Copy S3 URI

How to use

Use as part of an nlp pipeline as a substitute of the Tokenizer stage.

document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")
    
word_segmenter = WordSegmenterModel.pretrained('wordseg_best', 'th')\
        .setInputCols("document")\
        .setOutputCol("token")       
pipeline = Pipeline(stages=[document_assembler, word_segmenter])
example = spark.createDataFrame([['จวนจะถึงร้านที่คุณจองโต๊ะไว้แล้วจ้ะ']], ["text"])
result = pipeline.fit(example ).transform(example)
val document_assembler = DocumentAssembler()
        .setInputCol("text")
        .setOutputCol("document")
        
val document_assembler = DocumentAssembler()
        .setInputCol("text")
        .setOutputCol("document")
val word_segmenter = WordSegmenterModel.pretrained("wordseg_best", "th")
        .setInputCols("document")
        .setOutputCol("token")
val pipeline = new Pipeline().setStages(Array(document_assembler, word_segmenter))
val data = Seq("จวนจะถึงร้านที่คุณจองโต๊ะไว้แล้วจ้ะ").toDF("text")
val result = pipeline.fit(data).transform(data)
import nlu

text = ["""Mona Lisa เป็นภาพวาดสีน้ำมันในศตวรรษที่ 16 ที่สร้างโดย Leonardo จัดขึ้นที่พิพิธภัณฑ์ลูฟร์ในปารีส"""]
token_df = nlu.load('th.segment_words').predict(text)
token_df

Results

+-----------------------------------+---------------------------------------------------------+
|text                               |result                                                   |
+-----------------------------------+---------------------------------------------------------+
|จวนจะถึงร้านที่คุณจองโต๊ะไว้แล้วจ้ะ|[จวน, จะ, ถึง, ร้าน, ที่, คุณ, จอง, โต๊ะ, ไว้, แล้ว, จ้ะ]|
+-----------------------------------+---------------------------------------------------------+

Model Information

Model Name: wordseg_best
Compatibility: Spark NLP 2.7.0+
Edition: Official
Input Labels: [document]
Output Labels: [token]
Language: th

Data Source

The model was trained on the BEST corpus from the National Electronics and Computer Technology Center (NECTEC).

References:

  • Krit Kosawat, Monthika Boriboon, Patcharika Chootrakool, Ananlada Chotimongkol, Supon Klaithin, Sarawoot Kongyoung, Kanyanut Kriengket, Sitthaa Phaholphinyo, Sumonmas Purodakananda, Tipraporn Thanakulwarapas, and Chai Wutiwiwatchai, “BEST 2009: Thai word segmentation software contest,” in Proc. 8th Int. Symp. Natural Language Process. (SNLP), Bangkok, Thailand, Oct.20-22, 2009, pp.83-88.
  • Monthika Boriboon, Kanyanut Kriengket, Patcharika Chootrakool, Sitthaa Phaholphinyo, Sumonmas Purodakananda, Tipraporn Thanakulwarapas, and Krit Kosawat, “BEST corpus development and analysis,” in Proc. 2nd Int. Conf. Asian Language Process. (IALP), Singapore, Dec.7-9, 2009, pp.322-327.

Benchmarking

| Model        | precision | recall | f1-score |
|--------------|-----------|--------|----------|
| WORDSEG_BEST | 0.4791    | 0.6245 | 0.5422   |