`sparknlp.partition.partition_transformer`#

Contains the PartitionTransformer class for reading various types of documents into chunks.

Module Contents#

Classes#

PartitionTransformer

The PartitionTransformer annotator allows you to use the Partition feature more smoothly

class PartitionTransformer(classname='com.johnsnowlabs.partition.PartitionTransformer', java_model=None)[source]#

The PartitionTransformer annotator allows you to use the Partition feature more smoothly within existing Spark NLP workflows, enabling seamless reuse of your pipelines.

It supports reading from files, URLs, in-memory strings, or byte arrays, and works within a Spark NLP pipeline.

Supported formats include: - Plain text - HTML - Word (.doc/.docx) - Excel (.xls/.xlsx) - PowerPoint (.ppt/.pptx) - Email files (.eml, .msg) - PDFs

Parameters:

inputColslist of str: Names of input columns (typically from DocumentAssembler).
outputColstr: Name of the column to store the output.
contentTypestr: The type of content: e.g., “text”, “url”, “file”, etc.
headersdict, optional: Headers to be used if content type is a URL.

Examples

>>> dataset = spark.createDataFrame([
...     ("https://www.blizzard.com",),
... ], ["text"])

>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")

>>> partition = PartitionTransformer() \
...     .setInputCols(["document"]) \
...     .setOutputCol("partition") \
...     .setContentType("url") \
...     .setHeaders({"Accept-Language": "es-ES"})

>>> pipeline = Pipeline(stages=[documentAssembler, partition])
>>> pipelineModel = pipeline.fit(dataset)
>>> resultDf = pipelineModel.transform(dataset)
>>> resultDf.show()
+--------------------+--------------------+--------------------+
|                text|            document|           partition|
+--------------------+--------------------+--------------------+
|https://www.blizz...|[{Title, Juegos d...|[{document, 0, 16...|
|https://www.googl...|[{Title, Gmail Im...|[{document, 0, 28...|
+--------------------+--------------------+--------------------+

name = 'PartitionTransformer'[source]#

inputAnnotatorTypes[source]#

outputAnnotatorType = 'document'[source]#

contentPath[source]#

contentType[source]#

storeContent[source]#

titleFontSize[source]#

inferTableStructure[source]#

includePageBreaks[source]#

setContentPath(value)[source]#

getContentPath()[source]#

setContentType(value)[source]#

getContentType()[source]#

setStoreContent(value)[source]#

getStoreContent()[source]#

setTitleFontSize(value)[source]#

getTitleFontSize()[source]#

setInferTableStructure(value)[source]#

getInferTableStructure()[source]#

setIncludePageBreaks(value)[source]#

getIncludePageBreaks()[source]#

sparknlp.partition.partition_transformer#

Module Contents#

Classes#

`sparknlp.partition.partition_transformer`#