`sparknlp.annotator.stop_words_cleaner`#

Contains classes for the StopWordsCleaner.

Module Contents#

Classes#

StopWordsCleaner

This annotator takes a sequence of strings (e.g. the output of a

class StopWordsCleaner(classname='com.johnsnowlabs.nlp.annotators.StopWordsCleaner', java_model=None)[source]#

This annotator takes a sequence of strings (e.g. the output of a Tokenizer, Normalizer, Lemmatizer, and Stemmer) and drops all the stop words from the input sequences.

By default, it uses stop words from MLlibs StopWordsRemover. Stop words can also be defined by explicitly setting them with setStopWords() or loaded from pretrained models using pretrained of its companion object.

>>> stopWords = StopWordsCleaner.pretrained() \
...     .setInputCols(["token"]) \
...     .setOutputCol("cleanTokens")

This will load the default pretrained model "stopwords_en".

For available pretrained models please see the Models Hub.

For extended examples of usage, see the Examples.

Input Annotation types	Output Annotation type
`TOKEN`	`TOKEN`

Parameters:

stopWords: The words to be filtered out, by default english stopwords from Spark ML
caseSensitive: Whether to consider case, by default False
locale: Locale of the input. ignored when case sensitive, by default locale of the JVM

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> sentenceDetector = SentenceDetector() \
...     .setInputCols(["document"]) \
...     .setOutputCol("sentence")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["sentence"]) \
...     .setOutputCol("token")
>>> stopWords = StopWordsCleaner() \
...     .setInputCols(["token"]) \
...     .setOutputCol("cleanTokens") \
...     .setCaseSensitive(False)
>>> pipeline = Pipeline().setStages([
...       documentAssembler,
...       sentenceDetector,
...       tokenizer,
...       stopWords
...     ])
>>> data = spark.createDataFrame([
...     ["This is my first sentence. This is my second."],
...     ["This is my third sentence. This is my forth."]
... ]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.selectExpr("cleanTokens.result").show(truncate=False)
+-------------------------------+
|result                         |
+-------------------------------+
|[first, sentence, ., second, .]|
|[third, sentence, ., forth, .] |
+-------------------------------+

name = 'StopWordsCleaner'[source]#

inputAnnotatorTypes[source]#

outputAnnotatorType = 'token'[source]#

stopWords[source]#

caseSensitive[source]#

locale[source]#

setStopWords(value)[source]#

Sets the words to be filtered out, by default english stopwords from Spark ML.

Parameters:

valueList[str]: The words to be filtered out

setCaseSensitive(value)[source]#

Sets whether to do a case sensitive comparison, by default False.

Parameters:

valuebool: Whether to do a case sensitive comparison

setLocale(value)[source]#

Sets locale of the input. Ignored when case sensitive, by default locale of the JVM.

Parameters:

valuestr: Locale of the input

loadDefaultStopWords()[source]#

Loads the default stop words for the given language.

Supported languages: danish, dutch, english, finnish, french, german, hungarian, italian, norwegian, portuguese, russian, spanish, swedish, turkish

Parameters:

languagestr, optional: Language stopwords to load, by default “english”

static pretrained(name='stopwords_en', lang='en', remote_loc=None)[source]#

Downloads and loads a pretrained model.

Parameters:

namestr, optional: Name of the pretrained model, by default “stopwords_en”
langstr, optional: Language of the pretrained model, by default “en”
remote_locstr, optional: Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.

Returns:

StopWordsCleaner: The restored model

sparknlp.annotator.stop_words_cleaner#

Module Contents#

Classes#

`sparknlp.annotator.stop_words_cleaner`#