Description
This model annotates the part of speech of tokens in a text. The parts of speech annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically.
Open in Colab Download Copy S3 URI
How to use
...
pos = PerceptronModel.pretrained("pos_ud_perdt", "fa") \
.setInputCols(["document", "token"]) \
.setOutputCol("pos")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
results = light_pipeline.fullAnnotate(["جان اسنو جدا از سلطنت شمال ، یک پزشک انگلیسی و رهبر توسعه بیهوشی و بهداشت پزشکی است."])
...
val pos = PerceptronModel.pretrained("pos_ud_perdt", "fa")
.setInputCols(Array("document", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos))
val data = Seq("جان اسنو جدا از سلطنت شمال ، یک پزشک انگلیسی و رهبر توسعه بیهوشی و بهداشت پزشکی است.").toDF("text")
val result = pipeline.fit(data).transform(data)
import nlu
text = ["""جان اسنو جدا از سلطنت شمال ، یک پزشک انگلیسی و رهبر توسعه بیهوشی و بهداشت پزشکی است"""]
pos_df = nlu.load('fa.pos').predict(text)
pos_df
Results
{'pos': [Annotation(pos, 0, 2, NOUN, {'word': 'جان'}),
Annotation(pos, 4, 7, NOUN, {'word': 'اسنو'}),
Annotation(pos, 9, 11, ADJ, {'word': 'جدا'}),
Annotation(pos, 13, 14, ADP, {'word': 'از'}),
Annotation(pos, 16, 20, NOUN, {'word': 'سلطنت'}),
Annotation(pos, 22, 25, NOUN, {'word': 'شمال'}),
Annotation(pos, 27, 27, PUNCT, {'word': '،'}),
Annotation(pos, 29, 30, NUM, {'word': 'یک'}),
Annotation(pos, 32, 35, NOUN, {'word': 'پزشک'}),
Annotation(pos, 37, 43, ADJ, {'word': 'انگلیسی'}),
Annotation(pos, 45, 45, CCONJ, {'word': 'و'}),
Annotation(pos, 47, 50, NOUN, {'word': 'رهبر'}),
Annotation(pos, 52, 56, NOUN, {'word': 'توسعه'}),
Annotation(pos, 58, 63, VERB, {'word': 'بیهوشی'}),
Annotation(pos, 65, 65, CCONJ, {'word': 'و'}),
Annotation(pos, 67, 72, NOUN, {'word': 'بهداشت'}),
Annotation(pos, 74, 78, ADJ, {'word': 'پزشکی'}),
Annotation(pos, 80, 82, AUX, {'word': 'است'}),
Annotation(pos, 83, 83, PUNCT, {'word': '.'})]}
Model Information
Model Name: | pos_ud_perdt |
Compatibility: | Spark NLP 2.7.0+ |
Edition: | Official |
Input Labels: | [tags, document] |
Output Labels: | [pos] |
Language: | fa |
Data Source
The model is trained on data obtained from https://universaldependencies.org
Benchmarking
| | | precision | recall | f1-score | support |
|---:|:-------------|:------------|:---------|-----------:|----------:|
| 0 | ADJ | 0.88 | 0.88 | 0.88 | 1647 |
| 1 | ADP | 0.99 | 0.99 | 0.99 | 3402 |
| 2 | ADV | 0.94 | 0.91 | 0.92 | 383 |
| 3 | AUX | 0.99 | 0.99 | 0.99 | 1000 |
| 4 | CCONJ | 1.00 | 1.00 | 1 | 1022 |
| 5 | DET | 0.94 | 0.96 | 0.95 | 490 |
| 6 | INTJ | 0.88 | 0.81 | 0.85 | 27 |
| 7 | NOUN | 0.95 | 0.96 | 0.95 | 8201 |
| 8 | NUM | 0.94 | 0.97 | 0.96 | 293 |
| 9 | None | 1.00 | 0.99 | 0.99 | 289 |
| 10 | PART | 1.00 | 0.86 | 0.92 | 28 |
| 11 | PRON | 0.98 | 0.97 | 0.98 | 1117 |
| 12 | PROPN | 0.84 | 0.78 | 0.81 | 1107 |
| 13 | PUNCT | 1.00 | 1.00 | 1 | 2134 |
| 14 | SCONJ | 0.98 | 0.98 | 0.98 | 630 |
| 15 | VERB | 0.99 | 0.99 | 0.99 | 2581 |
| 16 | accuracy | | | 0.96 | 24351 |
| 17 | macro avg | 0.96 | 0.94 | 0.95 | 24351 |
| 18 | weighted avg | 0.96 | 0.96 | 0.96 | 24351 |
PREVIOUSPart of Speech for Arabic