Description
This model annotates the part of speech of tokens in a text. The parts of speech annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 13 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically.
Predicted Entities
Tags | Name | Description |
---|---|---|
AJ | Adjective | Attribute, modifier, or description of a noun |
AV | Adverb | Word that modifies or qualifies an adjective, verb, or another adverb |
AX | Auxiliary | Tense, aspect, mood, and voice |
CC | Connector | Conjunction and relative pronoun |
CL | Classifier | Class or measurement unit to which a noun or an action belongs |
FX | Prefix | Inflectional (nominalizer, adjectivizer, adverbializer, and courteous verbalizer), and derivational |
IJ | Interjection | Exclamation word |
NG | Negator | Word of negation |
NN | Noun | Person, place, thing, abstract concept, and proper name |
NU | Number | Quantity for counting and calculation |
PA | Particle | Politeness, intention, belief, question |
PR | Pronoun | Word used to refer to an element in the discourse |
PS | Preposition | Location, comparison, instrument, exemplification |
PU | Punctuation | Punctuation mark |
VV | Verb | Action, state, occurrence, and word that forms the predicate part |
XX | Others | Unknown category |
Live Demo Open in Colab Download Copy S3 URI
How to use
Use as part of an nlp pipeline after tokenization.
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
word_segmenter = WordSegmenterModel.pretrained("wordseg_best", "th")\
.setInputCols(["sentence"])\
.setOutputCol("token")
pos = PerceptronModel.pretrained("pos_lst20", "th") \
.setInputCols(["document", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
word_segmenter,
posTagger
])
example = spark.createDataFrame([['ส่วนผลกระทบจากโครงการดังกล่าวจะดำเนินการนอกเขตอุทยานแห่งชาตินอกพื้นที่ป่าอนุรักษ์']], ["text"])
result = pipeline.fit(example).transform(example)
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val word_segmenter = WordSegmenterModel.pretrained("wordseg_best", "th")
.setInputCols("sentence")
.setOutputCol("token")
val pos = PerceptronModel.pretrained("pos_lst20", "th")
.setInputCols(Array("document", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, word_segmenter, pos))
val data = Seq("ส่วนผลกระทบจากโครงการดังกล่าวจะดำเนินการนอกเขตอุทยานแห่งชาตินอกพื้นที่ป่าอนุรักษ์").toDF("text")
val result = pipeline.fit(data).transform(data)
import nlu
text = ["ส่วนผลกระทบจากโครงการดังกล่าวจะดำเนินการนอกเขตอุทยานแห่งชาตินอกพื้นที่ป่าอนุรักษ์"]
pos_df = nlu.load('th.pos').predict(text, output_level = "token")
pos_df
Results
+-------------------------------------------------------------------------------------------------+--------------------------------------------------------------------+
|text |result |
+-------------------------------------------------------------------------------------------------+--------------------------------------------------------------------+
|ส่วน ผล กระทบ จาก โครงการ ดังกล่าว จะ ดำเนินการ นอก เขต อุทยาน แห่ง ชาติ นอก พื้นที่ ป่า อนุรักษ์|[CC, NN, VV, PS, NN, AJ, AX, VV, PS, NN, NN, PS, NN, PS, NN, NN, VV]|
+-------------------------------------------------------------------------------------------------+--------------------------------------------------------------------+
Model Information
Model Name: | pos_lst20 |
Compatibility: | Spark NLP 2.7.0+ |
Edition: | Official |
Input Labels: | [sentence, token] |
Output Labels: | [pos] |
Language: | th |
Data Source
The model was trained on the LST20 Corpus from National Electronics and Computer Technology Center (NECTEC).
Benchmarking
| pos_tag | precision | recall | f1-score | support |
|--------------|-----------|--------|----------|---------|
| AJ | 0.73 | 0.66 | 0.69 | 4403 |
| AV | 0.71 | 0.61 | 0.66 | 6722 |
| AX | 0.76 | 0.75 | 0.76 | 7556 |
| CC | 0.77 | 0.77 | 0.77 | 17613 |
| CL | 0.68 | 0.63 | 0.65 | 3739 |
| FX | 0.78 | 0.76 | 0.77 | 6918 |
| IJ | 0.00 | 0.00 | 0.00 | 4 |
| NG | 0.82 | 0.80 | 0.81 | 1694 |
| NN | 0.82 | 0.81 | 0.81 | 58540 |
| NU | 0.75 | 0.71 | 0.73 | 6256 |
| PA | 0.74 | 0.84 | 0.79 | 194 |
| PR | 0.76 | 0.75 | 0.76 | 2139 |
| PS | 0.75 | 0.72 | 0.73 | 10886 |
| PU | 0.42 | 0.80 | 0.55 | 4769 |
| VV | 0.79 | 0.78 | 0.78 | 42586 |
| XX | 0.00 | 0.00 | 0.00 | 26 |
| accuracy | 0.77 | 174045 | | |
| macro avg | 0.64 | 0.65 | 0.64 | 174045 |
| weighted avg | 0.77 | 0.77 | 0.77 | 174045 |