Part of Speech for Thai (pos_lst20)

Description

This model annotates the part of speech of tokens in a text. The parts of speech annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 13 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically.

Predicted Entities

Tags Name Description
AJ Adjective Attribute, modifier, or description of a noun
AV Adverb Word that modifies or qualifies an adjective, verb, or another adverb
AX Auxiliary Tense, aspect, mood, and voice
CC Connector Conjunction and relative pronoun
CL Classifier Class or measurement unit to which a noun or an action belongs
FX Prefix Inflectional (nominalizer, adjectivizer, adverbializer, and courteous verbalizer), and derivational
IJ Interjection Exclamation word
NG Negator Word of negation
NN Noun Person, place, thing, abstract concept, and proper name
NU Number Quantity for counting and calculation
PA Particle Politeness, intention, belief, question
PR Pronoun Word used to refer to an element in the discourse
PS Preposition Location, comparison, instrument, exemplification
PU Punctuation Punctuation mark
VV Verb Action, state, occurrence, and word that forms the predicate part
XX Others Unknown category

Live Demo Open in Colab Download Copy S3 URI

How to use

Use as part of an nlp pipeline after tokenization.

document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")
    
sentence_detector = SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")
    
word_segmenter = WordSegmenterModel.pretrained("wordseg_best", "th")\
        .setInputCols(["sentence"])\
        .setOutputCol("token")
        
pos = PerceptronModel.pretrained("pos_lst20", "th") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("pos")

pipeline = Pipeline(stages=[
        document_assembler,
        sentence_detector,
        word_segmenter,
        posTagger
    ])

example = spark.createDataFrame([['ส่วนผลกระทบจากโครงการดังกล่าวจะดำเนินการนอกเขตอุทยานแห่งชาตินอกพื้นที่ป่าอนุรักษ์']], ["text"])

result = pipeline.fit(example).transform(example)
val document_assembler = DocumentAssembler()
        .setInputCol("text")
        .setOutputCol("document")
        
val sentence_detector = SentenceDetector()
        .setInputCols("document")
        .setOutputCol("sentence")
        
val word_segmenter = WordSegmenterModel.pretrained("wordseg_best", "th")
        .setInputCols("sentence")
        .setOutputCol("token")
        
val pos = PerceptronModel.pretrained("pos_lst20", "th")
    .setInputCols(Array("document", "token"))
    .setOutputCol("pos")

val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, word_segmenter, pos))

val data = Seq("ส่วนผลกระทบจากโครงการดังกล่าวจะดำเนินการนอกเขตอุทยานแห่งชาตินอกพื้นที่ป่าอนุรักษ์").toDF("text")
val result = pipeline.fit(data).transform(data)
import nlu

text = ["ส่วนผลกระทบจากโครงการดังกล่าวจะดำเนินการนอกเขตอุทยานแห่งชาตินอกพื้นที่ป่าอนุรักษ์"]
pos_df = nlu.load('th.pos').predict(text, output_level = "token")
pos_df

Results

+-------------------------------------------------------------------------------------------------+--------------------------------------------------------------------+
|text                                                                                             |result                                                              |
+-------------------------------------------------------------------------------------------------+--------------------------------------------------------------------+
|ส่วน ผล กระทบ จาก โครงการ ดังกล่าว จะ ดำเนินการ นอก เขต อุทยาน แห่ง ชาติ นอก พื้นที่ ป่า อนุรักษ์|[CC, NN, VV, PS, NN, AJ, AX, VV, PS, NN, NN, PS, NN, PS, NN, NN, VV]|
+-------------------------------------------------------------------------------------------------+--------------------------------------------------------------------+

Model Information

Model Name: pos_lst20
Compatibility: Spark NLP 2.7.0+
Edition: Official
Input Labels: [sentence, token]
Output Labels: [pos]
Language: th

Data Source

The model was trained on the LST20 Corpus from National Electronics and Computer Technology Center (NECTEC).

Benchmarking

| pos_tag      | precision | recall | f1-score | support |
|--------------|-----------|--------|----------|---------|
| AJ           | 0.73      | 0.66   | 0.69     | 4403    |
| AV           | 0.71      | 0.61   | 0.66     | 6722    |
| AX           | 0.76      | 0.75   | 0.76     | 7556    |
| CC           | 0.77      | 0.77   | 0.77     | 17613   |
| CL           | 0.68      | 0.63   | 0.65     | 3739    |
| FX           | 0.78      | 0.76   | 0.77     | 6918    |
| IJ           | 0.00      | 0.00   | 0.00     | 4       |
| NG           | 0.82      | 0.80   | 0.81     | 1694    |
| NN           | 0.82      | 0.81   | 0.81     | 58540   |
| NU           | 0.75      | 0.71   | 0.73     | 6256    |
| PA           | 0.74      | 0.84   | 0.79     | 194     |
| PR           | 0.76      | 0.75   | 0.76     | 2139    |
| PS           | 0.75      | 0.72   | 0.73     | 10886   |
| PU           | 0.42      | 0.80   | 0.55     | 4769    |
| VV           | 0.79      | 0.78   | 0.78     | 42586   |
| XX           | 0.00      | 0.00   | 0.00     | 26      |
| accuracy     | 0.77      | 174045 |          |         |
| macro avg    | 0.64      | 0.65   | 0.64     | 174045  |
| weighted avg | 0.77      | 0.77   | 0.77     | 174045  |