sparknlp.base.doc2_chunk
#
Contains classes for Doc2Chunk.
Module Contents#
Classes#
Converts |
- class Doc2Chunk[source]#
Converts
DOCUMENT
type annotations intoCHUNK
type with the contents of achunkCol
.Chunk text must be contained within input
DOCUMENT
. May be eitherStringType
orArrayType[StringType]
(using setIsArray). Useful for annotators that require a CHUNK type input.Input Annotation types
Output Annotation type
DOCUMENT
CHUNK
- Parameters:
- chunkCol
Column that contains the string. Must be part of DOCUMENT
- startCol
Column that has a reference of where the chunk begins
- startColByTokenIndex
Whether start column is prepended by whitespace tokens
- isArray
Whether the chunkCol is an array of strings, by default False
- failOnMissing
Whether to fail the job if a chunk is not found within document. Return empty otherwise
- lowerCase
Whether to lower case for matching case
See also
Chunk2Doc
for converting CHUNK annotations to DOCUMENT
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.common import * >>> from sparknlp.annotator import * >>> from sparknlp.training import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document") >>> chunkAssembler = Doc2Chunk() \ ... .setInputCols("document") \ ... .setChunkCol("target") \ ... .setOutputCol("chunk") \ ... .setIsArray(True) >>> data = spark.createDataFrame([[ ... "Spark NLP is an open-source text processing library for advanced natural language processing.", ... ["Spark NLP", "text processing library", "natural language processing"] ... ]]).toDF("text", "target") >>> pipeline = Pipeline().setStages([documentAssembler, chunkAssembler]).fit(data) >>> result = pipeline.transform(data) >>> result.selectExpr("chunk.result", "chunk.annotatorType").show(truncate=False) +-----------------------------------------------------------------+---------------------+ |result |annotatorType | +-----------------------------------------------------------------+---------------------+ |[Spark NLP, text processing library, natural language processing]|[chunk, chunk, chunk]| +-----------------------------------------------------------------+---------------------+
- setChunkCol(value)[source]#
Sets column that contains the string. Must be part of DOCUMENT.
- Parameters:
- valuestr
Name of the Chunk Column
- setIsArray(value)[source]#
Sets whether the chunkCol is an array of strings.
- Parameters:
- valuebool
Whether the chunkCol is an array of strings
- setStartCol(value)[source]#
Sets column that has a reference of where chunk begins.
- Parameters:
- valuestr
Name of the reference column
- setStartColByTokenIndex(value)[source]#
Sets whether start column is prepended by whitespace tokens.
- Parameters:
- valuebool
whether start column is prepended by whitespace tokens