sparknlp.annotator.lemmatizer
#
Contains classes for the Lemmatizer.
Module Contents#
Classes#
Class to find lemmas out of words with the objective of returning a base |
|
Instantiated Model of the Lemmatizer. |
- class Lemmatizer[source]#
Class to find lemmas out of words with the objective of returning a base dictionary word.
Retrieves the significant part of a word. A dictionary of predefined lemmas must be provided with
setDictionary()
.For instantiated/pretrained models, see
LemmatizerModel
.For available pretrained models please see the Models Hub. For extended examples of usage, see the Examples.
Input Annotation types
Output Annotation type
TOKEN
TOKEN
- Parameters:
- dictionary
lemmatizer external dictionary.
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline
In this example, the lemma dictionary
lemmas_small.txt
has the form of:... pick -> pick picks picking picked peck -> peck pecking pecked pecks pickle -> pickle pickles pickled pickling pepper -> pepper peppers peppered peppering ...
where each key is delimited by
->
and values are delimited by\t
>>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> sentenceDetector = SentenceDetector() \ ... .setInputCols(["document"]) \ ... .setOutputCol("sentence") >>> tokenizer = Tokenizer() \ ... .setInputCols(["sentence"]) \ ... .setOutputCol("token") >>> lemmatizer = Lemmatizer() \ ... .setInputCols(["token"]) \ ... .setOutputCol("lemma") \ ... .setDictionary("src/test/resources/lemma-corpus-small/lemmas_small.txt", "->", "\t") >>> pipeline = Pipeline() \ ... .setStages([ ... documentAssembler, ... sentenceDetector, ... tokenizer, ... lemmatizer ... ]) >>> data = spark.createDataFrame([["Peter Pipers employees are picking pecks of pickled peppers."]]) \ ... .toDF("text") >>> result = pipeline.fit(data).transform(data) >>> result.selectExpr("lemma.result").show(truncate=False) +------------------------------------------------------------------+ |result | +------------------------------------------------------------------+ |[Peter, Pipers, employees, are, pick, peck, of, pickle, pepper, .]| +------------------------------------------------------------------+
- setFormCol(value)[source]#
Column that correspends to CoNLLU(formCol=) output
- Parameters:
- valuestr
Name of column for Array of Form tokens
- setLemmaCol(value)[source]#
Column that correspends to CoNLLU(fromLemma=) output
- Parameters:
- valuestr
Name of column for Array of Lemma tokens
- setDictionary(path, key_delimiter, value_delimiter, read_as=ReadAs.TEXT, options={'format': 'text'})[source]#
Sets the external dictionary for the lemmatizer.
- Parameters:
- pathstr
Path to the source files
- key_delimiterstr
Delimiter for the key
- value_delimiterstr
Delimiter for the values
- read_asstr, optional
How to read the file, by default ReadAs.TEXT
- optionsdict, optional
Options to read the resource, by default {“format”: “text”}
Examples
Here the file has each key is delimited by
"->"
and values are delimited by\t
:... pick -> pick picks picking picked peck -> peck pecking pecked pecks pickle -> pickle pickles pickled pickling pepper -> pepper peppers peppered peppering ...
This file can then be parsed with
>>> lemmatizer = Lemmatizer() \ ... .setInputCols(["token"]) \ ... .setOutputCol("lemma") \ ... .setDictionary("lemmas_small.txt", "->", "\t")
- class LemmatizerModel(classname='com.johnsnowlabs.nlp.annotators.LemmatizerModel', java_model=None)[source]#
Instantiated Model of the Lemmatizer.
This is the instantiated model of the
Lemmatizer
. For training your own model, please see the documentation of that class.Pretrained models can be loaded with
pretrained()
of the companion object:>>> lemmatizer = LemmatizerModel.pretrained() \ ... .setInputCols(["token"]) \ ... .setOutputCol("lemma")
For available pretrained models please see the Models Hub.
Input Annotation types
Output Annotation type
TOKEN
TOKEN
- Parameters:
- None
Examples
The lemmatizer from the example of the
Lemmatizer
can be replaced with:>>> lemmatizer = LemmatizerModel.pretrained() \ ... .setInputCols(["token"]) \ ... .setOutputCol("lemma")
- static pretrained(name='lemma_antbnc', lang='en', remote_loc=None)[source]#
Downloads and loads a pretrained model.
- Parameters:
- namestr, optional
Name of the pretrained model, by default “lemma_antbnc”
- langstr, optional
Language of the pretrained model, by default “en”
- remote_locstr, optional
Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.
- Returns:
- LemmatizerModel
The restored model