`sparknlp.annotator.lemmatizer`#

Contains classes for the Lemmatizer.

Module Contents#

Classes#

`Lemmatizer`	Class to find lemmas out of words with the objective of returning a base
`LemmatizerModel`	Instantiated Model of the Lemmatizer.

class Lemmatizer[source]#

Class to find lemmas out of words with the objective of returning a base dictionary word.

Retrieves the significant part of a word. A dictionary of predefined lemmas must be provided with setDictionary().

For instantiated/pretrained models, see LemmatizerModel.

For available pretrained models please see the Models Hub. For extended examples of usage, see the Examples.

Input Annotation types	Output Annotation type
`TOKEN`	`TOKEN`

Parameters:

dictionary: lemmatizer external dictionary.

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline

In this example, the lemma dictionary lemmas_small.txt has the form of:

...
pick    ->      pick    picks   picking picked
peck    ->      peck    pecking pecked  pecks
pickle  ->      pickle  pickles pickled pickling
pepper  ->      pepper  peppers peppered        peppering
...

where each key is delimited by -> and values are delimited by \t

>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> sentenceDetector = SentenceDetector() \
...     .setInputCols(["document"]) \
...     .setOutputCol("sentence")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["sentence"]) \
...     .setOutputCol("token")
>>> lemmatizer = Lemmatizer() \
...     .setInputCols(["token"]) \
...     .setOutputCol("lemma") \
...     .setDictionary("src/test/resources/lemma-corpus-small/lemmas_small.txt", "->", "\t")
>>> pipeline = Pipeline() \
...     .setStages([
...       documentAssembler,
...       sentenceDetector,
...       tokenizer,
...       lemmatizer
...     ])
>>> data = spark.createDataFrame([["Peter Pipers employees are picking pecks of pickled peppers."]]) \
...     .toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.selectExpr("lemma.result").show(truncate=False)
+------------------------------------------------------------------+
|result                                                            |
+------------------------------------------------------------------+
|[Peter, Pipers, employees, are, pick, peck, of, pickle, pepper, .]|
+------------------------------------------------------------------+

inputAnnotatorTypes[source]#

outputAnnotatorType = 'token'[source]#

dictionary[source]#

formCol[source]#

lemmaCol[source]#

setFormCol(value)[source]#

Column that correspends to CoNLLU(formCol=) output

Parameters:

valuestr: Name of column for Array of Form tokens

setLemmaCol(value)[source]#

Column that correspends to CoNLLU(fromLemma=) output

Parameters:

valuestr: Name of column for Array of Lemma tokens

setDictionary(path, key_delimiter, value_delimiter, read_as=ReadAs.TEXT, options={'format': 'text'})[source]#

Sets the external dictionary for the lemmatizer.

Parameters:

pathstr: Path to the source files
key_delimiterstr: Delimiter for the key
value_delimiterstr: Delimiter for the values
read_asstr, optional: How to read the file, by default ReadAs.TEXT
optionsdict, optional: Options to read the resource, by default {“format”: “text”}

Examples

Here the file has each key is delimited by "->" and values are delimited by \t:

...
pick        ->      pick    picks   picking picked
peck        ->      peck    pecking pecked  pecks
pickle      ->      pickle  pickles pickled pickling
pepper      ->      pepper  peppers peppered        peppering
...

This file can then be parsed with

>>> lemmatizer = Lemmatizer() \
...     .setInputCols(["token"]) \
...     .setOutputCol("lemma") \
...     .setDictionary("lemmas_small.txt", "->", "\t")

class LemmatizerModel(classname='com.johnsnowlabs.nlp.annotators.LemmatizerModel', java_model=None)[source]#

Instantiated Model of the Lemmatizer.

This is the instantiated model of the Lemmatizer. For training your own model, please see the documentation of that class.

Pretrained models can be loaded with pretrained() of the companion object:

>>> lemmatizer = LemmatizerModel.pretrained() \
...     .setInputCols(["token"]) \
...     .setOutputCol("lemma")

For available pretrained models please see the Models Hub.

Input Annotation types	Output Annotation type
`TOKEN`	`TOKEN`

Parameters:

None

Examples

The lemmatizer from the example of the Lemmatizer can be replaced with:

>>> lemmatizer = LemmatizerModel.pretrained() \
...     .setInputCols(["token"]) \
...     .setOutputCol("lemma")

name = 'LemmatizerModel'[source]#

inputAnnotatorTypes[source]#

outputAnnotatorType = 'token'[source]#

static pretrained(name='lemma_antbnc', lang='en', remote_loc=None)[source]#

Downloads and loads a pretrained model.

Parameters:

namestr, optional: Name of the pretrained model, by default “lemma_antbnc”
langstr, optional: Language of the pretrained model, by default “en”
remote_locstr, optional: Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.

Returns:

LemmatizerModel: The restored model

sparknlp.annotator.lemmatizer#

Module Contents#

Classes#

`sparknlp.annotator.lemmatizer`#