sparknlp.annotator.n_gram_generator#

Contains classes for the NGramGenerator.

Module Contents#

Classes#

NGramGenerator

A feature transformer that converts the input array of strings

class NGramGenerator[source]#

A feature transformer that converts the input array of strings (annotatorType TOKEN) into an array of n-grams (annotatorType CHUNK).

Null values in the input array are ignored. It returns an array of n-grams where each n-gram is represented by a space-separated string of words.

When the input is empty, an empty array is returned. When the input array length is less than n (number of elements per n-gram), no n-grams are returned.

For more extended examples see the Examples.

Input Annotation types

Output Annotation type

TOKEN

CHUNK

Parameters:
n

Number elements per n-gram (>=1), by default 2

enableCumulative

Whether to calculate just the actual n-grams, by default False

delimiter

Character to use to join the tokens, by default “ “

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> sentence = SentenceDetector() \
...     .setInputCols(["document"]) \
...     .setOutputCol("sentence")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["sentence"]) \
...     .setOutputCol("token")
>>> nGrams = NGramGenerator() \
...     .setInputCols(["token"]) \
...     .setOutputCol("ngrams") \
...     .setN(2)
>>> pipeline = Pipeline().setStages([
...       documentAssembler,
...       sentence,
...       tokenizer,
...       nGrams
...     ])
>>> data = spark.createDataFrame([["This is my sentence."]]).toDF("text")
>>> results = pipeline.fit(data).transform(data)
>>> results.selectExpr("explode(ngrams) as result").show(truncate=False)
+------------------------------------------------------------+
|result                                                      |
+------------------------------------------------------------+
|[chunk, 0, 6, This is, [sentence -> 0, chunk -> 0], []]     |
|[chunk, 5, 9, is my, [sentence -> 0, chunk -> 1], []]       |
|[chunk, 8, 18, my sentence, [sentence -> 0, chunk -> 2], []]|
|[chunk, 11, 19, sentence ., [sentence -> 0, chunk -> 3], []]|
+------------------------------------------------------------+
setN(value)[source]#

Sets number elements per n-gram (>=1), by default 2.

Parameters:
valueint

Number elements per n-gram (>=1), by default 2

setEnableCumulative(value)[source]#

Sets whether to calculate just the actual n-grams, by default False.

Parameters:
valuebool

Whether to calculate just the actual n-grams

setDelimiter(value)[source]#

Sets character to use to join the tokens

Parameters:
valuestr

character to use to join the tokens

Raises:
Exception

Delimiter should have length == 1