sparknlp.annotator.lemmatizer#

Contains classes for the Lemmatizer.

Module Contents#

Classes#

Lemmatizer

Class to find lemmas out of words with the objective of returning a base

LemmatizerModel

Instantiated Model of the Lemmatizer.

class Lemmatizer[source]#

Class to find lemmas out of words with the objective of returning a base dictionary word.

Retrieves the significant part of a word. A dictionary of predefined lemmas must be provided with setDictionary().

For instantiated/pretrained models, see LemmatizerModel.

For available pretrained models please see the Models Hub. For extended examples of usage, see the Examples.

Input Annotation types

Output Annotation type

TOKEN

TOKEN

Parameters:
dictionary

lemmatizer external dictionary.

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline

In this example, the lemma dictionary lemmas_small.txt has the form of:

...
pick    ->      pick    picks   picking picked
peck    ->      peck    pecking pecked  pecks
pickle  ->      pickle  pickles pickled pickling
pepper  ->      pepper  peppers peppered        peppering
...

where each key is delimited by -> and values are delimited by \t

>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> sentenceDetector = SentenceDetector() \
...     .setInputCols(["document"]) \
...     .setOutputCol("sentence")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["sentence"]) \
...     .setOutputCol("token")
>>> lemmatizer = Lemmatizer() \
...     .setInputCols(["token"]) \
...     .setOutputCol("lemma") \
...     .setDictionary("src/test/resources/lemma-corpus-small/lemmas_small.txt", "->", "\t")
>>> pipeline = Pipeline() \
...     .setStages([
...       documentAssembler,
...       sentenceDetector,
...       tokenizer,
...       lemmatizer
...     ])
>>> data = spark.createDataFrame([["Peter Pipers employees are picking pecks of pickled peppers."]]) \
...     .toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.selectExpr("lemma.result").show(truncate=False)
+------------------------------------------------------------------+
|result                                                            |
+------------------------------------------------------------------+
|[Peter, Pipers, employees, are, pick, peck, of, pickle, pepper, .]|
+------------------------------------------------------------------+
setFormCol(value)[source]#

Column that correspends to CoNLLU(formCol=) output

Parameters:
valuestr

Name of column for Array of Form tokens

setLemmaCol(value)[source]#

Column that correspends to CoNLLU(fromLemma=) output

Parameters:
valuestr

Name of column for Array of Lemma tokens

setDictionary(path, key_delimiter, value_delimiter, read_as=ReadAs.TEXT, options={'format': 'text'})[source]#

Sets the external dictionary for the lemmatizer.

Parameters:
pathstr

Path to the source files

key_delimiterstr

Delimiter for the key

value_delimiterstr

Delimiter for the values

read_asstr, optional

How to read the file, by default ReadAs.TEXT

optionsdict, optional

Options to read the resource, by default {“format”: “text”}

Examples

Here the file has each key is delimited by "->" and values are delimited by \t:

...
pick        ->      pick    picks   picking picked
peck        ->      peck    pecking pecked  pecks
pickle      ->      pickle  pickles pickled pickling
pepper      ->      pepper  peppers peppered        peppering
...

This file can then be parsed with

>>> lemmatizer = Lemmatizer() \
...     .setInputCols(["token"]) \
...     .setOutputCol("lemma") \
...     .setDictionary("lemmas_small.txt", "->", "\t")
class LemmatizerModel(classname='com.johnsnowlabs.nlp.annotators.LemmatizerModel', java_model=None)[source]#

Instantiated Model of the Lemmatizer.

This is the instantiated model of the Lemmatizer. For training your own model, please see the documentation of that class.

Pretrained models can be loaded with pretrained() of the companion object:

>>> lemmatizer = LemmatizerModel.pretrained() \
...     .setInputCols(["token"]) \
...     .setOutputCol("lemma")

For available pretrained models please see the Models Hub.

Input Annotation types

Output Annotation type

TOKEN

TOKEN

Parameters:
None

Examples

The lemmatizer from the example of the Lemmatizer can be replaced with:

>>> lemmatizer = LemmatizerModel.pretrained() \
...     .setInputCols(["token"]) \
...     .setOutputCol("lemma")
static pretrained(name='lemma_antbnc', lang='en', remote_loc=None)[source]#

Downloads and loads a pretrained model.

Parameters:
namestr, optional

Name of the pretrained model, by default “lemma_antbnc”

langstr, optional

Language of the pretrained model, by default “en”

remote_locstr, optional

Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.

Returns:
LemmatizerModel

The restored model