sparknlp.annotator.n_gram_generator
#
Contains classes for the NGramGenerator.
Module Contents#
Classes#
A feature transformer that converts the input array of strings |
- class NGramGenerator[source]#
A feature transformer that converts the input array of strings (annotatorType
TOKEN
) into an array of n-grams (annotatorTypeCHUNK
).Null values in the input array are ignored. It returns an array of n-grams where each n-gram is represented by a space-separated string of words.
When the input is empty, an empty array is returned. When the input array length is less than n (number of elements per n-gram), no n-grams are returned.
For more extended examples see the Examples.
Input Annotation types
Output Annotation type
TOKEN
CHUNK
- Parameters:
- n
Number elements per n-gram (>=1), by default 2
- enableCumulative
Whether to calculate just the actual n-grams, by default False
- delimiter
Character to use to join the tokens, by default “ “
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> sentence = SentenceDetector() \ ... .setInputCols(["document"]) \ ... .setOutputCol("sentence") >>> tokenizer = Tokenizer() \ ... .setInputCols(["sentence"]) \ ... .setOutputCol("token") >>> nGrams = NGramGenerator() \ ... .setInputCols(["token"]) \ ... .setOutputCol("ngrams") \ ... .setN(2) >>> pipeline = Pipeline().setStages([ ... documentAssembler, ... sentence, ... tokenizer, ... nGrams ... ]) >>> data = spark.createDataFrame([["This is my sentence."]]).toDF("text") >>> results = pipeline.fit(data).transform(data) >>> results.selectExpr("explode(ngrams) as result").show(truncate=False) +------------------------------------------------------------+ |result | +------------------------------------------------------------+ |[chunk, 0, 6, This is, [sentence -> 0, chunk -> 0], []] | |[chunk, 5, 9, is my, [sentence -> 0, chunk -> 1], []] | |[chunk, 8, 18, my sentence, [sentence -> 0, chunk -> 2], []]| |[chunk, 11, 19, sentence ., [sentence -> 0, chunk -> 3], []]| +------------------------------------------------------------+
- setN(value)[source]#
Sets number elements per n-gram (>=1), by default 2.
- Parameters:
- valueint
Number elements per n-gram (>=1), by default 2