Description
This model annotates the part of speech of tokens in a text. The parts of speech annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically.
Open in Colab Download Copy S3 URI
How to use
...
pos = PerceptronModel.pretrained("pos_ud_edt", "et") \
.setInputCols(["document", "token"]) \
.setOutputCol("pos")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
results = light_pipeline.fullAnnotate(["Lisaks sellele, et ta on põhjamaa kuningas, on John Snow inglise arst ning narkoosi ja meditsiinilise hügieeni arendamise juht."])
...
val pos = PerceptronModel.pretrained("pos_ud_edt", "et")
.setInputCols(Array("document", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos))
val data = Seq("Lisaks sellele, et ta on põhjamaa kuningas, on John Snow inglise arst ning narkoosi ja meditsiinilise hügieeni arendamise juht.").toDF("text")
val result = pipeline.fit(data).transform(data)
Results
{'pos': [Annotation(pos, 0, 5, NOUN, {'word': 'Lisaks'}),
Annotation(pos, 7, 13, PRON, {'word': 'sellele'}),
Annotation(pos, 14, 14, PUNCT, {'word': ','}),
Annotation(pos, 16, 17, SCONJ, {'word': 'et'}),
Annotation(pos, 19, 20, PRON, {'word': 'ta'}),
Annotation(pos, 22, 23, AUX, {'word': 'on'}),
Annotation(pos, 25, 32, NOUN, {'word': 'põhjamaa'}),
Annotation(pos, 34, 41, NOUN, {'word': 'kuningas'}),
Annotation(pos, 42, 42, PUNCT, {'word': ','}),
Annotation(pos, 44, 45, AUX, {'word': 'on'}),
Annotation(pos, 47, 50, PROPN, {'word': 'John'}),
Annotation(pos, 52, 55, PROPN, {'word': 'Snow'}),
Annotation(pos, 57, 63, ADJ, {'word': 'inglise'}),
Annotation(pos, 65, 68, NOUN, {'word': 'arst'}),
Annotation(pos, 70, 73, CCONJ, {'word': 'ning'}),
Annotation(pos, 75, 82, NOUN, {'word': 'narkoosi'}),
Annotation(pos, 84, 85, CCONJ, {'word': 'ja'}),
Annotation(pos, 87, 100, NOUN, {'word': 'meditsiinilise'}),
Annotation(pos, 102, 109, NOUN, {'word': 'hügieeni'}),
Annotation(pos, 111, 120, NOUN, {'word': 'arendamise'}),
Annotation(pos, 122, 125, NOUN, {'word': 'juht'}),
Annotation(pos, 126, 126, PUNCT, {'word': '.'})]}
Model Information
Model Name: | pos_ud_edt |
Compatibility: | Spark NLP 2.7.0+ |
Edition: | Official |
Input Labels: | [tags, document] |
Output Labels: | [pos] |
Language: | et |
Data Source
The model is trained on data obtained from https://universaldependencies.org
Benchmarking
| | | precision | recall | f1-score | support |
|---:|:-------------|:------------|:---------|-----------:|----------:|
| 0 | ADJ | 0.86 | 0.82 | 0.84 | 3655 |
| 1 | ADP | 0.91 | 0.92 | 0.91 | 838 |
| 2 | ADV | 0.95 | 0.95 | 0.95 | 4553 |
| 3 | AUX | 0.94 | 0.98 | 0.96 | 2426 |
| 4 | CCONJ | 0.99 | 0.98 | 0.98 | 1820 |
| 5 | DET | 0.82 | 0.74 | 0.78 | 752 |
| 6 | INTJ | 0.92 | 0.68 | 0.78 | 50 |
| 7 | NOUN | 0.92 | 0.95 | 0.94 | 11352 |
| 8 | NUM | 0.96 | 0.90 | 0.93 | 756 |
| 9 | PRON | 0.93 | 0.94 | 0.93 | 2350 |
| 10 | PROPN | 0.96 | 0.92 | 0.94 | 2619 |
| 11 | PUNCT | 1.00 | 1.00 | 1 | 6989 |
| 12 | SCONJ | 0.96 | 0.99 | 0.98 | 1048 |
| 13 | SYM | 1.00 | 0.72 | 0.84 | 18 |
| 14 | VERB | 0.93 | 0.91 | 0.92 | 4846 |
| 15 | X | 0.56 | 0.15 | 0.23 | 68 |
| 16 | accuracy | | | 0.94 | 44140 |
| 17 | macro avg | 0.91 | 0.85 | 0.87 | 44140 |
| 18 | weighted avg | 0.94 | 0.94 | 0.94 | 44140 |
PREVIOUSUrdu Lemmatizer