Description
This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous.
Live Demo Open in Colab Download Copy S3 URI
How to use
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer()\
.setInputCols(["document"]) \
.setOutputCol("token")
lemmatizer = LemmatizerModel.pretrained("lemma", "bh") \
.setInputCols(["token"]) \
.setOutputCol("lemma")
nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
results = light_pipeline.fullAnnotate(["एह आयोजन में विश्व भोजपुरी सम्मेलन , पूर्वांचल एकता मंच , वीर कुँवर सिंह फाउन्डेशन , पूर्वांचल भोजपुरी महासभा , अउर हर्फ - मीडिया के सहभागिता बा ।"])
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val lemmatizer = LemmatizerModel.pretrained("lemma", "bh")
.setInputCols("token")
.setOutputCol("lemma")
val nlp_pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer))
val data = Seq("एह आयोजन में विश्व भोजपुरी सम्मेलन , पूर्वांचल एकता मंच , वीर कुँवर सिंह फाउन्डेशन , पूर्वांचल भोजपुरी महासभा , अउर हर्फ - मीडिया के सहभागिता बा ।").toDF("text")
val result = pipeline.fit(data).transform(data)
import nlu
text = ["एह आयोजन में विश्व भोजपुरी सम्मेलन , पूर्वांचल एकता मंच , वीर कुँवर सिंह फाउन्डेशन , पूर्वांचल भोजपुरी महासभा , अउर हर्फ - मीडिया के सहभागिता बा ।"]
lemma_df = nlu.load('bh.lemma').predict(text, output_level = "document")
lemma_df.lemma.values[0]
Results
{'lemma': [Annotation(token, 0, 1, एह, {'sentence': '0'}),
Annotation(token, 3, 7, आयोजन, {'sentence': '0'}),
Annotation(token, 9, 11, में, {'sentence': '0'}),
Annotation(token, 13, 17, विश्व, {'sentence': '0'}),
Annotation(token, 19, 25, भोजपुरी, {'sentence': '0'}),
Annotation(token, 27, 33, सम्मेलन, {'sentence': '0'}),
Annotation(token, 35, 35, COMMA, {'sentence': '0'}),
Annotation(token, 37, 45, पूर्वांचल, {'sentence': '0'}),
Annotation(token, 47, 50, एकता, {'sentence': '0'}),
Annotation(token, 52, 54, मंच, {'sentence': '0'}),
Annotation(token, 56, 56, COMMA, {'sentence': '0'}),
Annotation(token, 58, 60, वीर, {'sentence': '0'}),
Annotation(token, 62, 66, कुँवर, {'sentence': '0'}),
Annotation(token, 68, 71, सिंह, {'sentence': '0'}),
Annotation(token, 73, 81, फाउन्डेशन, {'sentence': '0'}),
Annotation(token, 83, 83, COMMA, {'sentence': '0'}),
Annotation(token, 85, 93, पूर्वांचल, {'sentence': '0'}),
Annotation(token, 95, 101, भोजपुरी, {'sentence': '0'}),
Annotation(token, 103, 108, महासभा, {'sentence': '0'}),
Annotation(token, 110, 110, COMMA, {'sentence': '0'}),
Annotation(token, 112, 114, अउर, {'sentence': '0'}),
Annotation(token, 116, 119, हर्फ, {'sentence': '0'}),
Annotation(token, 121, 121, -, {'sentence': '0'}),
Annotation(token, 123, 128, मीडिया, {'sentence': '0'}),
Annotation(token, 130, 131, को, {'sentence': '0'}),
Annotation(token, 133, 140, सहभागिता, {'sentence': '0'}),
Annotation(token, 142, 143, बा, {'sentence': '0'}),
Annotation(token, 145, 145, ।, {'sentence': '0'})]}
Model Information
Model Name: | lemma |
Compatibility: | Spark NLP 2.7.0+ |
Edition: | Official |
Input Labels: | [document] |
Output Labels: | [token] |
Language: | bh |
Data Source
The model was trained on the Universal Dependencies data set version 2.7.
Reference:
- Ojha, A. K., & Zeman, D. (2020). Universal Dependency Treebanks for Low-Resource Indian Languages: The Case of Bhojpuri. Proceedings of the WILDRE5{–} 5th Workshop on Indian Language Data: Resources and Evaluation.