Part of Speech for Bengali

Description

A Part of Speech classifier predicts a grammatical label for every token in the input text. Implemented with an averaged perceptron architecture.

Predicted Entities

  • NN
  • SYM
  • NNP
  • VM
  • INTF
  • JJ
  • QF
  • CC
  • NST
  • PSP
  • QC
  • DEM
  • RDP
  • PRP
  • NEG
  • WQ
  • RB
  • VAUX
  • UT
  • XC
  • RP
  • QO
  • BM
  • NNC
  • PPR
  • INJ
  • CL
  • UNK

Live Demo Open in Colab Download Copy S3 URI

How to use


document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")

tokenizer = Tokenizer()\
.setInputCols(["document"]) \
.setOutputCol("token")

posTagger = PerceptronModel.pretrained("pos_msri", "bn") \
.setInputCols(["document", "token"]) \
.setOutputCol("pos")

pipeline = Pipeline(stages=[document_assembler, tokenizer, posTagger])

example = spark.createDataFrame([['জন স্নো ল্যাবস থেকে হ্যালো! ']], ["text"])

result = pipeline.fit(example).transform(example)



val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")

val sentence_detector = SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")

val pos = PerceptronModel.pretrained("pos_msri", "bn")
.setInputCols(Array("document", "token"))
.setOutputCol("pos")

val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos))

val data = Seq("জন স্নো ল্যাবস থেকে হ্যালো! ").toDF("text")
val result = pipeline.fit(data).transform(data)


import nlu
text = [""জন নো যাবস থেকে যালো! ""]
token_df = nlu.load('bn.pos').predict(text)
token_df

Results

token  pos

0      জন   NN
1    স্নো   NN
2  ল্যাবস   NN
3    থেকে  PSP
4  হ্যালো   JJ
5       !  SYM

Model Information

Model Name: pos_msri
Compatibility: Spark NLP 3.0.0+
License: Open Source
Edition: Official
Input Labels: [document, token]
Output Labels: [pos]
Language: bn