`sparknlp.annotator.spell_check.symmetric_delete`#

Contains classes for SymmetricDelete.

Module Contents#

Classes#

`SymmetricDeleteApproach`	Trains a Symmetric Delete spelling correction algorithm. Retrieves tokens
`SymmetricDeleteModel`	Symmetric Delete spelling correction algorithm.

class SymmetricDeleteApproach[source]#

Trains a Symmetric Delete spelling correction algorithm. Retrieves tokens and utilizes distance metrics to compute possible derived words.

The Symmetric Delete spelling correction algorithm reduces the complexity of edit candidate generation and dictionary lookup for a given Damerau-Levenshtein distance. It is six orders of magnitude faster (than the standard approach with deletes + transposes + replaces + inserts) and language independent.

A dictionary of correct spellings must be provided with setDictionary() in the form of a text file, where each word is parsed by a regex pattern.

For instantiated/pretrained models, see SymmetricDeleteModel.

Input Annotation types	Output Annotation type
`TOKEN`	`TOKEN`

Parameters:

dictionary: folder or file with text that teaches about the language
maxEditDistance: max edit distance characters to derive strings from a word, by default 3
frequencyThreshold: minimum frequency of words to be considered from training, by default 0
deletesThreshold: minimum frequency of corrections a word needs to have to be considered from training, by default 0

See also

NorvigSweetingApproach: for an alternative approach to spell checking
ContextSpellCheckerApproach: for a DL based approach

References

Inspired by SymSpell.

Examples

In this example, the dictionary "words.txt" has the form of:

...
gummy
gummic
gummier
gummiest
gummiferous
...

This dictionary is then set to be the basis of the spell checker.

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["document"]) \
...     .setOutputCol("token")
>>> spellChecker = SymmetricDeleteApproach() \
...     .setInputCols(["token"]) \
...     .setOutputCol("spell") \
...     .setDictionary("src/test/resources/spell/words.txt")
>>> pipeline = Pipeline().setStages([
...     documentAssembler,
...     tokenizer,
...     spellChecker
... ])
>>> pipelineModel = pipeline.fit(trainingData)

inputAnnotatorTypes[source]#

outputAnnotatorType = 'token'[source]#

corpus[source]#

dictionary[source]#

maxEditDistance[source]#

frequencyThreshold[source]#

deletesThreshold[source]#

dupsLimit[source]#

dictionary_path = ''[source]#

setDictionary(path, token_pattern='\\S+', read_as=ReadAs.TEXT, options={'format': 'text'})[source]#

Sets folder or file with text that teaches about the language.

Parameters:

pathstr: Path to the resource
token_patternstr, optional: Regex patttern to extract tokens, by default “S+”
read_asstr, optional: How to read the resource, by default ReadAs.TEXT
optionsdict, optional: Options for reading the resource, by default {“format”: “text”}

setMaxEditDistance(v)[source]#

Sets max edit distance characters to derive strings from a word, by default 3.

Parameters:

vint: Max edit distance characters to derive strings from a word

setFrequencyThreshold(v)[source]#

Sets minimum frequency of words to be considered from training, by default 0.

Parameters:

vint: Minimum frequency of words to be considered from training

setDeletesThreshold(v)[source]#

Sets minimum frequency of corrections a word needs to have to be considered from training, by default 0.

Parameters:

vint: Minimum frequency of corrections a word needs to have to be considered from training

class SymmetricDeleteModel(classname='com.johnsnowlabs.nlp.annotators.spell.symmetric.SymmetricDeleteModel', java_model=None)[source]#

Symmetric Delete spelling correction algorithm.

Pretrained models can be loaded with pretrained() of the companion object:

>>> spell = SymmetricDeleteModel.pretrained() \
...     .setInputCols(["token"]) \
...     .setOutputCol("spell")

The default model is "spellcheck_sd", if no name is provided. For available pretrained models please see the Models Hub.

Input Annotation types	Output Annotation type
`TOKEN`	`TOKEN`

Parameters:

None

See also

NorvigSweetingModel: for an alternative approach to spell checking
ContextSpellCheckerModel: for a DL based approach

References

Inspired by SymSpell.

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["document"]) \
...     .setOutputCol("token")
>>> spellChecker = SymmetricDeleteModel.pretrained() \
...     .setInputCols(["token"]) \
...     .setOutputCol("spell")
>>> pipeline = Pipeline().setStages([
...     documentAssembler,
...     tokenizer,
...     spellChecker
... ])
>>> data = spark.createDataFrame([["spmetimes i wrrite wordz erong."]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.select("spell.result").show(truncate=False)
+--------------------------------------+
|result                                |
+--------------------------------------+
|[sometimes, i, write, words, wrong, .]|
+--------------------------------------+

name = 'SymmetricDeleteModel'[source]#

inputAnnotatorTypes[source]#

outputAnnotatorType = 'token'[source]#

static pretrained(name='spellcheck_sd', lang='en', remote_loc=None)[source]#

Downloads and loads a pretrained model.

Parameters:

namestr, optional: Name of the pretrained model, by default “spellcheck_sd”
langstr, optional: Language of the pretrained model, by default “en”
remote_locstr, optional: Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.

Returns:

SymmetricDeleteModel: The restored model

sparknlp.annotator.spell_check.symmetric_delete#

Module Contents#

Classes#

`sparknlp.annotator.spell_check.symmetric_delete`#