sparknlp.annotator.chunk2_doc#

Contains classes for Chunk2Doc.

Module Contents#

Classes#

Chunk2Doc

Converts a CHUNK type column back into DOCUMENT.

class Chunk2Doc[source]#

Converts a CHUNK type column back into DOCUMENT.

Useful when trying to re-tokenize or do further analysis on a CHUNK result.

Input Annotation types

Output Annotation type

CHUNK

DOCUMENT

Parameters:
None

See also

Doc2Chunk

for converting DOCUMENT annotations to CHUNK

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.pretrained import PretrainedPipeline

Location entities are extracted and converted back into DOCUMENT type for further processing.

>>> data = spark.createDataFrame([[1, "New York and New Jersey aren't that far apart actually."]]).toDF("id", "text")

Define pretrained pipeline that extracts Named Entities amongst other things and apply Chunk2Doc on it.

>>> pipeline = PretrainedPipeline("explain_document_dl")
>>> chunkToDoc = Chunk2Doc().setInputCols("entities").setOutputCol("chunkConverted")
>>> explainResult = pipeline.transform(data)

Show results.

>>> result = chunkToDoc.transform(explainResult)
>>> result.selectExpr("explode(chunkConverted)").show(truncate=False)
+------------------------------------------------------------------------------+
|col                                                                           |
+------------------------------------------------------------------------------+
|[document, 0, 7, New York, [entity -> LOC, sentence -> 0, chunk -> 0], []]    |
|[document, 13, 22, New Jersey, [entity -> LOC, sentence -> 0, chunk -> 1], []]|
+------------------------------------------------------------------------------+