Description
This model annotates the part of speech of tokens in a text. The parts of speech annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically.
Open in Colab Download Copy S3 URI
How to use
...
pos = PerceptronModel.pretrained("pos_ud_udtb", "ur") \
.setInputCols(["document", "token"]) \
.setOutputCol("pos")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
results = light_pipeline.fullAnnotate(["شمال کا بادشاہ ہونے کے علاوہ ، جان سن ایک انگریزی معالج ہے۔"])
...
val pos = PerceptronModel.pretrained("pos_ud_udtb", "ur")
.setInputCols(Array("document", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos))
val data = Seq("شمال کا بادشاہ ہونے کے علاوہ ، جان سن ایک انگریزی معالج ہے۔").toDF("text")
val result = pipeline.fit(data).transform(data)
import nlu
text = ["""شمال کا بادشاہ ہونے کے علاوہ ، جان سن ایک انگریزی معالج ہے۔"""]
pos_df = nlu.load('ur.pos.ud_udtb').predict(text)
pos_df
Results
{'pos': [Annotation(pos, 0, 3, NOUN, {'word': 'شمال'}),
Annotation(pos, 5, 6, ADP, {'word': 'کا'}),
Annotation(pos, 8, 13, NOUN, {'word': 'بادشاہ'}),
Annotation(pos, 15, 18, VERB, {'word': 'ہونے'}),
Annotation(pos, 20, 21, ADP, {'word': 'کے'}),
Annotation(pos, 23, 27, ADP, {'word': 'علاوہ'}),
Annotation(pos, 29, 29, PUNCT, {'word': '،'}),
Annotation(pos, 31, 33, PROPN, {'word': 'جان'}),
Annotation(pos, 35, 36, PROPN, {'word': 'سن'}),
Annotation(pos, 38, 40, NUM, {'word': 'ایک'}),
Annotation(pos, 42, 48, PROPN, {'word': 'انگریزی'}),
Annotation(pos, 50, 54, ADJ, {'word': 'معالج'}),
Annotation(pos, 56, 58, PUNCT, {'word': 'ہے۔'})]}
Model Information
Model Name: | pos_ud_udtb |
Compatibility: | Spark NLP 2.7.0+ |
Edition: | Official |
Input Labels: | [tags, document] |
Output Labels: | [pos] |
Language: | ur |
Data Source
The model is trained on data obtained from https://universaldependencies.org
Benchmarking
| | | precision | recall | f1-score | support |
|---:|:-------------|:------------|:---------|-----------:|----------:|
| 0 | ADJ | 0.84 | 0.85 | 0.84 | 1117 |
| 1 | ADP | 0.98 | 0.99 | 0.98 | 3122 |
| 2 | ADV | 0.83 | 0.65 | 0.73 | 125 |
| 3 | AUX | 0.97 | 0.96 | 0.96 | 937 |
| 4 | CCONJ | 0.96 | 1.00 | 0.98 | 338 |
| 5 | DET | 0.87 | 0.82 | 0.84 | 237 |
| 6 | NOUN | 0.89 | 0.92 | 0.9 | 3690 |
| 7 | NUM | 0.97 | 0.95 | 0.96 | 267 |
| 8 | PART | 0.96 | 0.88 | 0.91 | 337 |
| 9 | PRON | 0.96 | 0.94 | 0.95 | 499 |
| 10 | PROPN | 0.88 | 0.85 | 0.86 | 1975 |
| 11 | PUNCT | 1.00 | 1.00 | 1 | 682 |
| 12 | SCONJ | 0.97 | 0.99 | 0.98 | 248 |
| 13 | VERB | 0.95 | 0.95 | 0.95 | 1232 |
| 14 | accuracy | | | 0.93 | 14806 |
| 15 | macro avg | 0.93 | 0.91 | 0.92 | 14806 |
| 16 | weighted avg | 0.93 | 0.93 | 0.93 | 14806 |