Description
This model annotates the part of speech of tokens in a text. The parts of speech annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 13 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically.
Predicted Entities
ADJ, ADP, ADV, AUX, CONJ, DET, NOUN, NUM, PART, PRON, PROPN, PUNCT, SYM, VERB, and X.
Live Demo Open in Colab Download Copy S3 URI
How to use
Use as part of an nlp pipeline after tokenization.
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
word_segmenter = WordSegmenterModel.pretrained("wordseg_gsd_ud_trad", "zh")\
.setInputCols(["sentence"])\
.setOutputCol("token")
pos = PerceptronModel.pretrained("pos_ud_gsd_trad", "zh") \
.setInputCols(["document", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
word_segmenter,
posTagger
])
example = spark.createDataFrame([['然而,這樣的處理也衍生了一些問題。']], ["text"])
result = pipeline.fit(example).transform(example)
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val word_segmenter = WordSegmenterModel.pretrained("wordseg_gsd_ud_trad", "zh")
.setInputCols("sentence")
.setOutputCol("token")
val pos = PerceptronModel.pretrained("pos_ud_gsd_trad", "zh")
.setInputCols(Array("document", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, word_segmenter, pos))
val data = Seq("然而,這樣的處理也衍生了一些問題。").toDF("text")
val result = pipeline.fit(data).transform(data)
import nlu
text = ["""然而,這樣的處理也衍生了一些問題。"""]
pos_df = nlu.load('zh.pos.ud_gsd_trad').predict(text, output_level = "token")
pos_df
Results
+------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------+
|text |result |
+------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------+
|然而 , 這樣 的 處理 也 衍生 了 一些 問題 。 |[ADV, PUNCT, PRON, PART, NOUN, ADV, VERB, PART, ADJ, NOUN, PUNCT] |
+------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------+
Model Information
| Model Name: | pos_ud_gsd_trad |
| Compatibility: | Spark NLP 2.7.0+ |
| License: | Open Source |
| Edition: | Official |
| Input Labels: | [sentence, token] |
| Output Labels: | [pos] |
| Language: | zh |
Data Source
The model was trained on the Universal Dependencies for Traditional Chinese annotated and converted by Google.
Benchmarking
| | precision | recall | f1-score | support |
|--------------|-----------|--------|----------|---------|
| ADJ | 0.70 | 0.68 | 0.69 | 272 |
| ADP | 0.85 | 0.86 | 0.85 | 535 |
| ADV | 0.90 | 0.90 | 0.90 | 549 |
| AUX | 0.88 | 0.88 | 0.88 | 281 |
| CCONJ | 0.92 | 0.87 | 0.89 | 191 |
| DET | 0.93 | 0.93 | 0.93 | 138 |
| NOUN | 0.88 | 0.92 | 0.90 | 3312 |
| NUM | 0.98 | 0.99 | 0.98 | 653 |
| PART | 0.97 | 0.94 | 0.95 | 1359 |
| PRON | 0.97 | 0.97 | 0.97 | 168 |
| PROPN | 0.89 | 0.84 | 0.86 | 1006 |
| PUNCT | 1.00 | 1.00 | 1.00 | 1688 |
| SYM | 1.00 | 1.00 | 1.00 | 3 |
| VERB | 0.86 | 0.83 | 0.85 | 1769 |
| X | 1.00 | 0.88 | 0.93 | 88 |
| accuracy | | | 0.91 | 12012 |
| macro avg | 0.91 | 0.90 | 0.91 | 12012 |
| weighted avg | 0.91 | 0.91 | 0.91 | 12012 |