sparknlp.annotator.lemmatizer#
Contains classes for the Lemmatizer.
Module Contents#
Classes#
Class to find lemmas out of words with the objective of returning a base |
|
Instantiated Model of the Lemmatizer. |
- class Lemmatizer[source]#
Class to find lemmas out of words with the objective of returning a base dictionary word.
Retrieves the significant part of a word. A dictionary of predefined lemmas must be provided with
setDictionary().For instantiated/pretrained models, see
LemmatizerModel.For available pretrained models please see the Models Hub. For extended examples of usage, see the Examples.
Input Annotation types
Output Annotation type
TOKENTOKEN- Parameters:
- dictionary
lemmatizer external dictionary.
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline
In this example, the lemma dictionary
lemmas_small.txthas the form of:... pick -> pick picks picking picked peck -> peck pecking pecked pecks pickle -> pickle pickles pickled pickling pepper -> pepper peppers peppered peppering ...
where each key is delimited by
->and values are delimited by\t>>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> sentenceDetector = SentenceDetector() \ ... .setInputCols(["document"]) \ ... .setOutputCol("sentence") >>> tokenizer = Tokenizer() \ ... .setInputCols(["sentence"]) \ ... .setOutputCol("token") >>> lemmatizer = Lemmatizer() \ ... .setInputCols(["token"]) \ ... .setOutputCol("lemma") \ ... .setDictionary("src/test/resources/lemma-corpus-small/lemmas_small.txt", "->", "\t") >>> pipeline = Pipeline() \ ... .setStages([ ... documentAssembler, ... sentenceDetector, ... tokenizer, ... lemmatizer ... ]) >>> data = spark.createDataFrame([["Peter Pipers employees are picking pecks of pickled peppers."]]) \ ... .toDF("text") >>> result = pipeline.fit(data).transform(data) >>> result.selectExpr("lemma.result").show(truncate=False) +------------------------------------------------------------------+ |result | +------------------------------------------------------------------+ |[Peter, Pipers, employees, are, pick, peck, of, pickle, pepper, .]| +------------------------------------------------------------------+
- setFormCol(value)[source]#
Column that correspends to CoNLLU(formCol=) output
- Parameters:
- valuestr
Name of column for Array of Form tokens
- setLemmaCol(value)[source]#
Column that correspends to CoNLLU(fromLemma=) output
- Parameters:
- valuestr
Name of column for Array of Lemma tokens
- setDictionary(path, key_delimiter, value_delimiter, read_as=ReadAs.TEXT, options={'format': 'text'})[source]#
Sets the external dictionary for the lemmatizer.
- Parameters:
- pathstr
Path to the source files
- key_delimiterstr
Delimiter for the key
- value_delimiterstr
Delimiter for the values
- read_asstr, optional
How to read the file, by default ReadAs.TEXT
- optionsdict, optional
Options to read the resource, by default {“format”: “text”}
Examples
Here the file has each key is delimited by
"->"and values are delimited by\t:... pick -> pick picks picking picked peck -> peck pecking pecked pecks pickle -> pickle pickles pickled pickling pepper -> pepper peppers peppered peppering ...
This file can then be parsed with
>>> lemmatizer = Lemmatizer() \ ... .setInputCols(["token"]) \ ... .setOutputCol("lemma") \ ... .setDictionary("lemmas_small.txt", "->", "\t")
- class LemmatizerModel(classname='com.johnsnowlabs.nlp.annotators.LemmatizerModel', java_model=None)[source]#
Instantiated Model of the Lemmatizer.
This is the instantiated model of the
Lemmatizer. For training your own model, please see the documentation of that class.Pretrained models can be loaded with
pretrained()of the companion object:>>> lemmatizer = LemmatizerModel.pretrained() \ ... .setInputCols(["token"]) \ ... .setOutputCol("lemma")
For available pretrained models please see the Models Hub.
Input Annotation types
Output Annotation type
TOKENTOKEN- Parameters:
- None
Examples
The lemmatizer from the example of the
Lemmatizercan be replaced with:>>> lemmatizer = LemmatizerModel.pretrained() \ ... .setInputCols(["token"]) \ ... .setOutputCol("lemma")
- static pretrained(name='lemma_antbnc', lang='en', remote_loc=None)[source]#
Downloads and loads a pretrained model.
- Parameters:
- namestr, optional
Name of the pretrained model, by default “lemma_antbnc”
- langstr, optional
Language of the pretrained model, by default “en”
- remote_locstr, optional
Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.
- Returns:
- LemmatizerModel
The restored model