Part of Speech for Polish

Description

A Part of Speech classifier predicts a grammatical label for every token in the input text. Implemented with an averaged perceptron architecture.

Predicted Entities

ADJ
ADP
ADV
AUX
CCONJ
DET
NOUN
NUM
PART
PRON
PROPN
PUNCT
VERB

Live Demo Open in Colab Download Copy S3 URI

How to use

document_assembler = DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentence_detector = SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols("sentence")\
    .setOutputCol("token")

pos = PerceptronModel.pretrained("pos_lfg", "pl")\
  .setInputCols(["document", "token"])\
  .setOutputCol("pos")

pipeline = Pipeline(stages=[
  document_assembler,
  sentence_detector,
  tokenizer,
  posTagger
])

example = spark.createDataFrame([['Zarobki wszystkich nauczycieli będą rosły co rok .']], ["text"])
result = pipeline.fit(example).transform(example)

val document_assembler = DocumentAssembler()
        .setInputCol("text")
        .setOutputCol("document")

val sentence_detector = SentenceDetector()
        .setInputCols("document")
	.setOutputCol("sentence")

val tokenizer = Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val pos = PerceptronModel.pretrained("pos_lfg", "pl")
        .setInputCols(Array("document", "token"))
        .setOutputCol("pos")

val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector,tokenizer, pos))

val data = Seq("Zarobki wszystkich nauczycieli będą rosły co rok .").toDF("text")
val result = pipeline.fit(data).transform(data)

import nlu

text = [""Zarobki wszystkich nauczycieli będą rosły co rok .""]
token_df = nlu.load('pl.pos.lfg').predict(text)
token_df

Results

+--------------------------------------------------+----------------------------------------------+
|text                                              |result                                        |
+--------------------------------------------------+----------------------------------------------+
|Zarobki wszystkich nauczycieli będą rosły co rok .|[NOUN, DET, NOUN, AUX, VERB, ADP, NOUN, PUNCT]|
+--------------------------------------------------+----------------------------------------------+

Model Information

Model Name:	pos_lfg
Compatibility:	Spark NLP 2.7.5+
License:	Open Source
Edition:	Official
Input Labels:	[sentence, token]
Output Labels:	[pos]
Language:	pl

Data Source

The model was trained on the Universal Dependencies data set.

Benchmarking

|              | precision | recall | f1-score | support |
|--------------|-----------|--------|----------|---------|
| ADJ          | 0.93      | 0.90   | 0.92     | 830     |
| ADP          | 0.98      | 0.99   | 0.99     | 1097    |
| ADV          | 0.91      | 0.94   | 0.93     | 589     |
| AUX          | 0.93      | 0.95   | 0.94     | 429     |
| CCONJ        | 0.98      | 0.99   | 0.98     | 354     |
| DET          | 0.94      | 0.91   | 0.93     | 324     |
| INTJ         | 0.67      | 0.33   | 0.44     | 6       |
| NOUN         | 0.93      | 0.95   | 0.94     | 2457    |
| NUM          | 0.92      | 0.94   | 0.93     | 90      |
| PART         | 0.99      | 0.95   | 0.97     | 597     |
| PRON         | 0.98      | 0.97   | 0.97     | 986     |
| PROPN        | 0.92      | 0.87   | 0.89     | 470     |
| PUNCT        | 1.00      | 1.00   | 1.00     | 2555    |
| SCONJ        | 0.97      | 0.99   | 0.98     | 141     |
| VERB         | 0.96      | 0.96   | 0.96     | 2187    |
| accuracy     |           |        | 0.96     | 13112   |
| macro avg    | 0.93      | 0.91   | 0.92     | 13112   |
| weighted avg | 0.96      | 0.96   | 0.96     | 13112   |
| weighted avg | 0.92      | 0.92   | 0.92     | 139697  |

PREVIOUSPart of Speech for Icelandic

NEXTPart of Speech for Slovak