Part of Speech for Hebrew

Description

This model annotates the part of speech of tokens in a text. The parts of speech annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically.

Open in Colab Download Copy S3 URI

How to use

Use as part of an nlp pipeline after tokenization.

...
pos = PerceptronModel.pretrained("pos_ud_htb", "he") \
.setInputCols(["document", "token"]) \
.setOutputCol("pos")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
results = light_pipeline.fullAnnotate(["ב- 25 לאוגוסט עצר השב"כ את מוחמד אבו-ג'וייד , אזרח ירדני , שגויס לארגון הפת"ח והופעל על ידי חיזבאללה"])

...
val pos = PerceptronModel.pretrained("pos_ud_htb", "he")
.setInputCols(Array("document", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos))
val data = Seq("ב- 25 לאוגוסט עצר השב"כ את מוחמד אבו-ג'וייד , אזרח ירדני , שגויס לארגון הפת"ח והופעל על ידי חיזבאללה").toDF("text")
val result = pipeline.fit(data).transform(data)
import nlu

text = ["ב- 25 לאוגוסט עצר השב"כ את מוחמד אבו-ג'וייד , אזרח ירדני , שגויס לארגון הפת"ח והופעל על ידי חיזבאללה"]
pos_df = nlu.load('he.pos.ud_htb').predict(text, output_level='token')
pos_df

Results

{'pos': [Annotation(pos, 0, 0, ADP, {'word': 'ב'}),
Annotation(pos, 1, 1, PUNCT, {'word': '-'}),
Annotation(pos, 3, 4, NUM, {'word': '25'}),
Annotation(pos, 6, 12, VERB, {'word': 'לאוגוסט'}),
Annotation(pos, 14, 16, None, {'word': 'עצר'}),
Annotation(pos, 18, 22, VERB, {'word': 'השב"כ'}),
Annotation(pos, 24, 25, ADP, {'word': 'את'}),
Annotation(pos, 27, 31, PROPN, {'word': 'מוחמד'}),
Annotation(pos, 33, 42, PROPN, {'word': "אבו-ג'וייד"}),
Annotation(pos, 44, 44, PUNCT, {'word': ','}),
Annotation(pos, 46, 49, NOUN, {'word': 'אזרח'}),
Annotation(pos, 51, 55, ADJ, {'word': 'ירדני'}),
Annotation(pos, 57, 57, PUNCT, {'word': ','}),
Annotation(pos, 59, 63, VERB, {'word': 'שגויס'}),
Annotation(pos, 65, 70, ADP, {'word': 'לארגון'}),
Annotation(pos, 72, 76, NOUN, {'word': 'הפת"ח'}),
Annotation(pos, 78, 83, PROPN, {'word': 'והופעל'}),
Annotation(pos, 85, 86, ADP, {'word': 'על'}),
Annotation(pos, 88, 90, NOUN, {'word': 'ידי'}),
Annotation(pos, 92, 99, PROPN, {'word': 'חיזבאללה'})]}

Model Information

Model Name: pos_ud_htb
Compatibility: Spark NLP 2.7.0+
Edition: Official
Input Labels: [tags, document]
Output Labels: [pos]
Language: he

Data Source

The model is trained on data obtained from https://universaldependencies.org

Benchmarking

|    |              | precision   | recall   |   f1-score |   support |
|---:|:-------------|:------------|:---------|-----------:|----------:|
|  0 | ADJ          | 0.83        | 0.83     |       0.83 |       676 |
|  1 | ADP          | 0.99        | 0.99     |       0.99 |      1889 |
|  2 | ADV          | 0.93        | 0.89     |       0.91 |       408 |
|  3 | AUX          | 0.90        | 0.90     |       0.9  |       229 |
|  4 | CCONJ        | 0.97        | 0.99     |       0.98 |       434 |
|  5 | DET          | 0.97        | 0.99     |       0.98 |      1390 |
|  6 | NOUN         | 0.91        | 0.94     |       0.93 |      3056 |
|  7 | NUM          | 0.97        | 0.96     |       0.97 |       285 |
|  9 | PRON         | 0.97        | 0.99     |       0.98 |       443 |
| 10 | PROPN        | 0.82        | 0.72     |       0.77 |       573 |
| 11 | PUNCT        | 1.00        | 1.00     |       1    |      1381 |
| 12 | SCONJ        | 0.99        | 0.90     |       0.94 |       411 |
| 13 | VERB         | 0.87        | 0.85     |       0.86 |      1063 |
| 14 | X            | 1.00        | 0.17     |       0.29 |         6 |
| 15 | accuracy     |             |          |       0.95 |     15089 |
| 16 | macro avg    | 0.94        | 0.87     |       0.89 |     15089 |
| 17 | weighted avg | 0.95        | 0.95     |       0.95 |     15089 |