sparknlp.annotator.chunker#

Contains classes for the Chunker.

Module Contents#

Classes#

Chunker

This annotator matches a pattern of part-of-speech tags in order to

class Chunker[source]#

This annotator matches a pattern of part-of-speech tags in order to return meaningful phrases from document. Extracted part-of-speech tags are mapped onto the sentence, which can then be parsed by regular expressions. The part-of-speech tags are wrapped by angle brackets <> to be easily distinguishable in the text itself.

This example sentence will result in the form:

"Peter Pipers employees are picking pecks of pickled peppers."
"<NNP><NNP><NNS><VBP><VBG><NNS><IN><JJ><NNS><.>"

To then extract these tags, regexParsers need to be set with e.g.:

>>> chunker = Chunker() \
...    .setInputCols(["sentence", "pos"]) \
...    .setOutputCol("chunk") \
...    .setRegexParsers(["<NNP>+", "<NNS>+"])

When defining the regular expressions, tags enclosed in angle brackets are treated as groups, so here specifically "<NNP>+" means 1 or more nouns in succession.

For more extended examples see the Examples.

Input Annotation types

Output Annotation type

DOCUMENT, POS

CHUNK

Parameters:
regexParsers

An array of grammar based chunk parsers

See also

PerceptronModel

for Part-Of-Speech tagging

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> sentence = SentenceDetector() \
...     .setInputCols("document") \
...     .setOutputCol("sentence")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["sentence"]) \
...     .setOutputCol("token")
>>> POSTag = PerceptronModel.pretrained() \
...     .setInputCols("document", "token") \
...     .setOutputCol("pos")
>>> chunker = Chunker() \
...     .setInputCols("sentence", "pos") \
...     .setOutputCol("chunk") \
...     .setRegexParsers(["<NNP>+", "<NNS>+"])
>>> pipeline = Pipeline() \
...     .setStages([
...       documentAssembler,
...       sentence,
...       tokenizer,
...       POSTag,
...       chunker
...     ])
>>> data = spark.createDataFrame([["Peter Pipers employees are picking pecks of pickled peppers."]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.selectExpr("explode(chunk) as result").show(truncate=False)
+-------------------------------------------------------------+
|result                                                       |
+-------------------------------------------------------------+
|[chunk, 0, 11, Peter Pipers, [sentence -> 0, chunk -> 0], []]|
|[chunk, 13, 21, employees, [sentence -> 0, chunk -> 1], []]  |
|[chunk, 35, 39, pecks, [sentence -> 0, chunk -> 2], []]      |
|[chunk, 52, 58, peppers, [sentence -> 0, chunk -> 3], []]    |
+-------------------------------------------------------------+
setRegexParsers(value)[source]#

Sets an array of grammar based chunk parsers.

POS classes should be enclosed in angle brackets, then treated as groups.

Parameters:
valueList[str]

Array of grammar based chunk parsers

Examples

>>> chunker = Chunker() \
...     .setInputCols("sentence", "pos") \
...     .setOutputCol("chunk") \
...     .setRegexParsers(["<NNP>+", "<NNS>+"])