Description
This model annotates the part of speech of tokens in a text. The parts of speech annotated include NN (noun), CC (Conjuncts - coordinating and subordinating), and 26 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically.
Predicted Entities
BM
(Not Documented), CC (Conjuncts, Coordinating and Subordinating)
, CL (Clitics)
, DEM (Demonstratives)
, INJ (Interjection)
, INTF (Intensifier)
, JJ (Adjective)
, NEG (Negative)
, NN (Noun)
, NNC (Compound Nouns)
, NNP (Proper Noun)
, NST (Preposition of Direction)
, PPR (Postposition)
, PRP (Pronoun)
, PSP (Preprosition)
, QC (Cardinal Number)
, QF (Quantifiers)
, QO (Ordinal Numbers)
, RB (Adverb)
, RDP (Not Documented)
, RP (Particle)
, SYM (Special Symbol)
, UT (Not Documented)
, VAUX (Verb Auxiliary)
, VM (Verb)
, WQ (wh- qualifier)
Live Demo Open in Colab Download Copy S3 URI
How to use
Use as part of an nlp pipeline after tokenization.
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
pos = PerceptronModel.pretrained("pos_msri", "bn") \
.setInputCols(["document", "token"]) \
.setOutputCol("pos")
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
posTagger
])
example = spark.createDataFrame([["বাসস্থান-ঘরগৃহস্থালি তোড়া ভাষায় গ্রামকেও বলে মোদ ' ৷"]], ["text"])
result = pipeline.fit(example).transform(example)
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val pos = PerceptronModel.pretrained("pos_lst20", "th")
.setInputCols(Array("document", "token"))
.setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos))
val data = Seq("বাসস্থান-ঘরগৃহস্থালি তোড়া ভাষায় গ্রামকেও বলে ` মোদ ' ৷").toDF("text")
val result = pipeline.fit(data).transform(data)
import nlu
text = ["বাসস্থান-ঘরগৃহস্থালি তোড়া ভাষায় গ্রামকেও বলে ` মোদ ' ৷"]
pos_df = nlu.load('bn.pos').predict(text, output_level = "token")
pos_df
Results
+------------------------------------------------------+----------------------------------------+
|text |result |
+------------------------------------------------------+----------------------------------------+
|বাসস্থান-ঘরগৃহস্থালি তোড়া ভাষায় গ্রামকেও বলে ` মোদ ' ৷|[NN, NNP, NN, NN, VM, SYM, NN, SYM, SYM]|
+------------------------------------------------------+----------------------------------------+
Model Information
Model Name: | pos_msri |
Compatibility: | Spark NLP 2.7.0+ |
License: | Open Source |
Edition: | Official |
Input Labels: | [sentence, token] |
Output Labels: | [pos] |
Language: | bn |
Data Source
The model was trained on the Indian Language POS-Tagged Corpus from NLTK collected by A Kumaran (Microsoft Research, India).
Benchmarking
| | precision | recall | f1-score | support |
|--------------|-----------|--------|----------|---------|
| BM | 1.00 | 1.00 | 1.00 | 1 |
| CC | 0.99 | 0.99 | 0.99 | 390 |
| CL | 1.00 | 1.00 | 1.00 | 2 |
| DEM | 0.98 | 0.99 | 0.98 | 139 |
| INJ | 0.92 | 0.85 | 0.88 | 13 |
| INTF | 1.00 | 1.00 | 1.00 | 55 |
| JJ | 0.99 | 0.99 | 0.99 | 688 |
| NEG | 0.99 | 0.98 | 0.99 | 135 |
| NN | 0.99 | 0.99 | 0.99 | 2996 |
| NNC | 1.00 | 1.00 | 1.00 | 4 |
| NNP | 0.97 | 0.98 | 0.97 | 528 |
| NST | 1.00 | 1.00 | 1.00 | 156 |
| PPR | 1.00 | 1.00 | 1.00 | 1 |
| PRP | 0.98 | 0.98 | 0.98 | 685 |
| PSP | 0.99 | 0.99 | 0.99 | 250 |
| QC | 0.99 | 0.99 | 0.99 | 193 |
| QF | 0.98 | 0.98 | 0.98 | 187 |
| QO | 1.00 | 1.00 | 1.00 | 22 |
| RB | 0.99 | 0.99 | 0.99 | 187 |
| RDP | 1.00 | 0.98 | 0.99 | 44 |
| RP | 0.99 | 0.96 | 0.97 | 79 |
| SYM | 0.97 | 0.98 | 0.98 | 1413 |
| UNK | 1.00 | 1.00 | 1.00 | 1 |
| UT | 1.00 | 1.00 | 1.00 | 18 |
| VAUX | 0.97 | 0.97 | 0.97 | 400 |
| VM | 0.99 | 0.98 | 0.98 | 1393 |
| WQ | 1.00 | 0.99 | 0.99 | 71 |
| XC | 0.98 | 0.97 | 0.97 | 219 |
| accuracy | | | 0.98 | 10270 |
| macro avg | 0.99 | 0.98 | 0.99 | 10270 |
| weighted avg | 0.98 | 0.98 | 0.98 | 10270 |