`sparknlp.base.doc2_chunk`#

Contains classes for Doc2Chunk.

Module Contents#

Classes#

Doc2Chunk

Converts DOCUMENT type annotations into CHUNK type with the

class Doc2Chunk[source]#

Converts DOCUMENT type annotations into CHUNK type with the contents of a chunkCol.

Chunk text must be contained within input DOCUMENT. May be either StringType or ArrayType[StringType] (using setIsArray). Useful for annotators that require a CHUNK type input.

Input Annotation types	Output Annotation type
`DOCUMENT`	`CHUNK`

Parameters:

chunkCol: Column that contains the string. Must be part of DOCUMENT
startCol: Column that has a reference of where the chunk begins
startColByTokenIndex: Whether start column is prepended by whitespace tokens
isArray: Whether the chunkCol is an array of strings, by default False
failOnMissing: Whether to fail the job if a chunk is not found within document. Return empty otherwise
lowerCase: Whether to lower case for matching case

See also

Chunk2Doc: for converting CHUNK annotations to DOCUMENT

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.common import *
>>> from sparknlp.annotator import *
>>> from sparknlp.training import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
>>> chunkAssembler = Doc2Chunk() \
...     .setInputCols("document") \
...     .setChunkCol("target") \
...     .setOutputCol("chunk") \
...     .setIsArray(True)
>>> data = spark.createDataFrame([[
...     "Spark NLP is an open-source text processing library for advanced natural language processing.",
...     ["Spark NLP", "text processing library", "natural language processing"]
... ]]).toDF("text", "target")
>>> pipeline = Pipeline().setStages([documentAssembler, chunkAssembler]).fit(data)
>>> result = pipeline.transform(data)
>>> result.selectExpr("chunk.result", "chunk.annotatorType").show(truncate=False)
+-----------------------------------------------------------------+---------------------+
|result                                                           |annotatorType        |
+-----------------------------------------------------------------+---------------------+
|[Spark NLP, text processing library, natural language processing]|[chunk, chunk, chunk]|
+-----------------------------------------------------------------+---------------------+

inputAnnotatorTypes[source]#

outputAnnotatorType = 'chunk'[source]#

chunkCol[source]#

startCol[source]#

startColByTokenIndex[source]#

isArray[source]#

failOnMissing[source]#

lowerCase[source]#

name = 'Doc2Chunk'[source]#

setParams()[source]#

setChunkCol(value)[source]#

Sets column that contains the string. Must be part of DOCUMENT.

Parameters:

valuestr: Name of the Chunk Column

setIsArray(value)[source]#

Sets whether the chunkCol is an array of strings.

Parameters:

valuebool: Whether the chunkCol is an array of strings

setStartCol(value)[source]#

Sets column that has a reference of where chunk begins.

Parameters:

valuestr: Name of the reference column

setStartColByTokenIndex(value)[source]#

Sets whether start column is prepended by whitespace tokens.

Parameters:

valuebool: whether start column is prepended by whitespace tokens

setFailOnMissing(value)[source]#

Sets whether to fail the job if a chunk is not found within document. Return empty otherwise.

Parameters:

valuebool: Whether to fail job on missing chunks

setLowerCase(value)[source]#

Sets whether to lower case for matching case.

Parameters:

valuebool: Name of the Id Column

sparknlp.base.doc2_chunk#

Module Contents#

Classes#

`sparknlp.base.doc2_chunk`#