sparknlp.base.doc2_chunk#

Contains classes for Doc2Chunk.

Module Contents#

Classes#

Doc2Chunk

Converts DOCUMENT type annotations into CHUNK type with the

class Doc2Chunk[source]#

Converts DOCUMENT type annotations into CHUNK type with the contents of a chunkCol.

Chunk text must be contained within input DOCUMENT. May be either StringType or ArrayType[StringType] (using setIsArray). Useful for annotators that require a CHUNK type input.

Input Annotation types

Output Annotation type

DOCUMENT

CHUNK

Parameters:
chunkCol

Column that contains the string. Must be part of DOCUMENT

startCol

Column that has a reference of where the chunk begins

startColByTokenIndex

Whether start column is prepended by whitespace tokens

isArray

Whether the chunkCol is an array of strings, by default False

failOnMissing

Whether to fail the job if a chunk is not found within document. Return empty otherwise

lowerCase

Whether to lower case for matching case

See also

Chunk2Doc

for converting CHUNK annotations to DOCUMENT

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.common import *
>>> from sparknlp.annotator import *
>>> from sparknlp.training import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
>>> chunkAssembler = Doc2Chunk() \
...     .setInputCols("document") \
...     .setChunkCol("target") \
...     .setOutputCol("chunk") \
...     .setIsArray(True)
>>> data = spark.createDataFrame([[
...     "Spark NLP is an open-source text processing library for advanced natural language processing.",
...     ["Spark NLP", "text processing library", "natural language processing"]
... ]]).toDF("text", "target")
>>> pipeline = Pipeline().setStages([documentAssembler, chunkAssembler]).fit(data)
>>> result = pipeline.transform(data)
>>> result.selectExpr("chunk.result", "chunk.annotatorType").show(truncate=False)
+-----------------------------------------------------------------+---------------------+
|result                                                           |annotatorType        |
+-----------------------------------------------------------------+---------------------+
|[Spark NLP, text processing library, natural language processing]|[chunk, chunk, chunk]|
+-----------------------------------------------------------------+---------------------+
setChunkCol(value)[source]#

Sets column that contains the string. Must be part of DOCUMENT.

Parameters:
valuestr

Name of the Chunk Column

setIsArray(value)[source]#

Sets whether the chunkCol is an array of strings.

Parameters:
valuebool

Whether the chunkCol is an array of strings

setStartCol(value)[source]#

Sets column that has a reference of where chunk begins.

Parameters:
valuestr

Name of the reference column

setStartColByTokenIndex(value)[source]#

Sets whether start column is prepended by whitespace tokens.

Parameters:
valuebool

whether start column is prepended by whitespace tokens

setFailOnMissing(value)[source]#

Sets whether to fail the job if a chunk is not found within document. Return empty otherwise.

Parameters:
valuebool

Whether to fail job on missing chunks

setLowerCase(value)[source]#

Sets whether to lower case for matching case.

Parameters:
valuebool

Name of the Id Column