Part of Speech for Thai (pos_lst20)

Description

This model annotates the part of speech of tokens in a text. The parts of speech annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 13 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically.

Predicted Entities

Tags	Name	Description
AJ	Adjective	Attribute, modifier, or description of a noun
AV	Adverb	Word that modifies or qualifies an adjective, verb, or another adverb
AX	Auxiliary	Tense, aspect, mood, and voice
CC	Connector	Conjunction and relative pronoun
CL	Classifier	Class or measurement unit to which a noun or an action belongs
FX	Prefix	Inflectional (nominalizer, adjectivizer, adverbializer, and courteous verbalizer), and derivational
IJ	Interjection	Exclamation word
NG	Negator	Word of negation
NN	Noun	Person, place, thing, abstract concept, and proper name
NU	Number	Quantity for counting and calculation
PA	Particle	Politeness, intention, belief, question
PR	Pronoun	Word used to refer to an element in the discourse
PS	Preposition	Location, comparison, instrument, exemplification
PU	Punctuation	Punctuation mark
VV	Verb	Action, state, occurrence, and word that forms the predicate part
XX	Others	Unknown category

Live Demo Open in Colab Download Copy S3 URI

How to use

Use as part of an nlp pipeline after tokenization.

document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")
    
sentence_detector = SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")
    
word_segmenter = WordSegmenterModel.pretrained("wordseg_best", "th")\
        .setInputCols(["sentence"])\
        .setOutputCol("token")
        
pos = PerceptronModel.pretrained("pos_lst20", "th") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("pos")

pipeline = Pipeline(stages=[
        document_assembler,
        sentence_detector,
        word_segmenter,
        posTagger
    ])

example = spark.createDataFrame([['ส่วนผลกระทบจากโครงการดังกล่าวจะดำเนินการนอกเขตอุทยานแห่งชาตินอกพื้นที่ป่าอนุรักษ์']], ["text"])

result = pipeline.fit(example).transform(example)

val document_assembler = DocumentAssembler()
        .setInputCol("text")
        .setOutputCol("document")
        
val sentence_detector = SentenceDetector()
        .setInputCols("document")
        .setOutputCol("sentence")
        
val word_segmenter = WordSegmenterModel.pretrained("wordseg_best", "th")
        .setInputCols("sentence")
        .setOutputCol("token")
        
val pos = PerceptronModel.pretrained("pos_lst20", "th")
    .setInputCols(Array("document", "token"))
    .setOutputCol("pos")

val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, word_segmenter, pos))

val data = Seq("ส่วนผลกระทบจากโครงการดังกล่าวจะดำเนินการนอกเขตอุทยานแห่งชาตินอกพื้นที่ป่าอนุรักษ์").toDF("text")
val result = pipeline.fit(data).transform(data)

import nlu

text = ["ส่วนผลกระทบจากโครงการดังกล่าวจะดำเนินการนอกเขตอุทยานแห่งชาตินอกพื้นที่ป่าอนุรักษ์"]
pos_df = nlu.load('th.pos').predict(text, output_level = "token")
pos_df

Results

+-------------------------------------------------------------------------------------------------+--------------------------------------------------------------------+
|text                                                                                             |result                                                              |
+-------------------------------------------------------------------------------------------------+--------------------------------------------------------------------+
|ส่วน ผล กระทบ จาก โครงการ ดังกล่าว จะ ดำเนินการ นอก เขต อุทยาน แห่ง ชาติ นอก พื้นที่ ป่า อนุรักษ์|[CC, NN, VV, PS, NN, AJ, AX, VV, PS, NN, NN, PS, NN, PS, NN, NN, VV]|
+-------------------------------------------------------------------------------------------------+--------------------------------------------------------------------+

Model Information

Model Name:	pos_lst20
Compatibility:	Spark NLP 2.7.0+
Edition:	Official
Input Labels:	[sentence, token]
Output Labels:	[pos]
Language:	th

Data Source

The model was trained on the LST20 Corpus from National Electronics and Computer Technology Center (NECTEC).

Benchmarking

| pos_tag      | precision | recall | f1-score | support |
|--------------|-----------|--------|----------|---------|
| AJ           | 0.73      | 0.66   | 0.69     | 4403    |
| AV           | 0.71      | 0.61   | 0.66     | 6722    |
| AX           | 0.76      | 0.75   | 0.76     | 7556    |
| CC           | 0.77      | 0.77   | 0.77     | 17613   |
| CL           | 0.68      | 0.63   | 0.65     | 3739    |
| FX           | 0.78      | 0.76   | 0.77     | 6918    |
| IJ           | 0.00      | 0.00   | 0.00     | 4       |
| NG           | 0.82      | 0.80   | 0.81     | 1694    |
| NN           | 0.82      | 0.81   | 0.81     | 58540   |
| NU           | 0.75      | 0.71   | 0.73     | 6256    |
| PA           | 0.74      | 0.84   | 0.79     | 194     |
| PR           | 0.76      | 0.75   | 0.76     | 2139    |
| PS           | 0.75      | 0.72   | 0.73     | 10886   |
| PU           | 0.42      | 0.80   | 0.55     | 4769    |
| VV           | 0.79      | 0.78   | 0.78     | 42586   |
| XX           | 0.00      | 0.00   | 0.00     | 26      |
| accuracy     | 0.77      | 174045 |          |         |
| macro avg    | 0.64      | 0.65   | 0.64     | 174045  |
| weighted avg | 0.77      | 0.77   | 0.77     | 174045  |

PREVIOUSNamed Entity Recognition for Thai (GloVe 840B 300d)

NEXTThai Word Segmentation