sparknlp.annotator.spell_check.symmetric_delete#

Contains classes for SymmetricDelete.

Module Contents#

Classes#

SymmetricDeleteApproach

Trains a Symmetric Delete spelling correction algorithm. Retrieves tokens

SymmetricDeleteModel

Symmetric Delete spelling correction algorithm.

class SymmetricDeleteApproach[source]#

Trains a Symmetric Delete spelling correction algorithm. Retrieves tokens and utilizes distance metrics to compute possible derived words.

The Symmetric Delete spelling correction algorithm reduces the complexity of edit candidate generation and dictionary lookup for a given Damerau-Levenshtein distance. It is six orders of magnitude faster (than the standard approach with deletes + transposes + replaces + inserts) and language independent.

A dictionary of correct spellings must be provided with setDictionary() in the form of a text file, where each word is parsed by a regex pattern.

For instantiated/pretrained models, see SymmetricDeleteModel.

Input Annotation types

Output Annotation type

TOKEN

TOKEN

Parameters:
dictionary

folder or file with text that teaches about the language

maxEditDistance

max edit distance characters to derive strings from a word, by default 3

frequencyThreshold

minimum frequency of words to be considered from training, by default 0

deletesThreshold

minimum frequency of corrections a word needs to have to be considered from training, by default 0

See also

NorvigSweetingApproach

for an alternative approach to spell checking

ContextSpellCheckerApproach

for a DL based approach

References

Inspired by SymSpell.

Examples

In this example, the dictionary "words.txt" has the form of:

...
gummy
gummic
gummier
gummiest
gummiferous
...

This dictionary is then set to be the basis of the spell checker.

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["document"]) \
...     .setOutputCol("token")
>>> spellChecker = SymmetricDeleteApproach() \
...     .setInputCols(["token"]) \
...     .setOutputCol("spell") \
...     .setDictionary("src/test/resources/spell/words.txt")
>>> pipeline = Pipeline().setStages([
...     documentAssembler,
...     tokenizer,
...     spellChecker
... ])
>>> pipelineModel = pipeline.fit(trainingData)
setDictionary(path, token_pattern='\\S+', read_as=ReadAs.TEXT, options={'format': 'text'})[source]#

Sets folder or file with text that teaches about the language.

Parameters:
pathstr

Path to the resource

token_patternstr, optional

Regex patttern to extract tokens, by default “S+”

read_asstr, optional

How to read the resource, by default ReadAs.TEXT

optionsdict, optional

Options for reading the resource, by default {“format”: “text”}

setMaxEditDistance(v)[source]#

Sets max edit distance characters to derive strings from a word, by default 3.

Parameters:
vint

Max edit distance characters to derive strings from a word

setFrequencyThreshold(v)[source]#

Sets minimum frequency of words to be considered from training, by default 0.

Parameters:
vint

Minimum frequency of words to be considered from training

setDeletesThreshold(v)[source]#

Sets minimum frequency of corrections a word needs to have to be considered from training, by default 0.

Parameters:
vint

Minimum frequency of corrections a word needs to have to be considered from training

class SymmetricDeleteModel(classname='com.johnsnowlabs.nlp.annotators.spell.symmetric.SymmetricDeleteModel', java_model=None)[source]#

Symmetric Delete spelling correction algorithm.

The Symmetric Delete spelling correction algorithm reduces the complexity of edit candidate generation and dictionary lookup for a given Damerau-Levenshtein distance. It is six orders of magnitude faster (than the standard approach with deletes + transposes + replaces + inserts) and language independent.

Pretrained models can be loaded with pretrained() of the companion object:

>>> spell = SymmetricDeleteModel.pretrained() \
...     .setInputCols(["token"]) \
...     .setOutputCol("spell")

The default model is "spellcheck_sd", if no name is provided. For available pretrained models please see the Models Hub.

Input Annotation types

Output Annotation type

TOKEN

TOKEN

Parameters:
None

See also

NorvigSweetingModel

for an alternative approach to spell checking

ContextSpellCheckerModel

for a DL based approach

References

Inspired by SymSpell.

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["document"]) \
...     .setOutputCol("token")
>>> spellChecker = SymmetricDeleteModel.pretrained() \
...     .setInputCols(["token"]) \
...     .setOutputCol("spell")
>>> pipeline = Pipeline().setStages([
...     documentAssembler,
...     tokenizer,
...     spellChecker
... ])
>>> data = spark.createDataFrame([["spmetimes i wrrite wordz erong."]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.select("spell.result").show(truncate=False)
+--------------------------------------+
|result                                |
+--------------------------------------+
|[sometimes, i, write, words, wrong, .]|
+--------------------------------------+
static pretrained(name='spellcheck_sd', lang='en', remote_loc=None)[source]#

Downloads and loads a pretrained model.

Parameters:
namestr, optional

Name of the pretrained model, by default “spellcheck_sd”

langstr, optional

Language of the pretrained model, by default “en”

remote_locstr, optional

Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.

Returns:
SymmetricDeleteModel

The restored model