sparknlp.partition.partition_transformer#

Contains the PartitionTransformer class for reading various types of documents into chunks.

Module Contents#

Classes#

PartitionTransformer

The PartitionTransformer annotator allows you to use the Partition feature more smoothly

class PartitionTransformer(classname='com.johnsnowlabs.partition.PartitionTransformer', java_model=None)[source]#

The PartitionTransformer annotator allows you to use the Partition feature more smoothly within existing Spark NLP workflows, enabling seamless reuse of your pipelines.

It supports reading from files, URLs, in-memory strings, or byte arrays, and works within a Spark NLP pipeline.

Supported formats include: - Plain text - HTML - Word (.doc/.docx) - Excel (.xls/.xlsx) - PowerPoint (.ppt/.pptx) - Email files (.eml, .msg) - PDFs

Parameters:
inputColslist of str

Names of input columns (typically from DocumentAssembler).

outputColstr

Name of the column to store the output.

contentTypestr

The type of content: e.g., “text”, “url”, “file”, etc.

headersdict, optional

Headers to be used if content type is a URL.

Examples

>>> dataset = spark.createDataFrame([
...     ("https://www.blizzard.com",),
... ], ["text"])
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> partition = PartitionTransformer() \
...     .setInputCols(["document"]) \
...     .setOutputCol("partition") \
...     .setContentType("url") \
...     .setHeaders({"Accept-Language": "es-ES"})
>>> pipeline = Pipeline(stages=[documentAssembler, partition])
>>> pipelineModel = pipeline.fit(dataset)
>>> resultDf = pipelineModel.transform(dataset)
>>> resultDf.show()
+--------------------+--------------------+--------------------+
|                text|            document|           partition|
+--------------------+--------------------+--------------------+
|https://www.blizz...|[{Title, Juegos d...|[{document, 0, 16...|
|https://www.googl...|[{Title, Gmail Im...|[{document, 0, 28...|
+--------------------+--------------------+--------------------+
name = 'PartitionTransformer'[source]#
inputAnnotatorTypes[source]#
outputAnnotatorType = 'document'[source]#
contentPath[source]#
contentType[source]#
storeContent[source]#
titleFontSize[source]#
inferTableStructure[source]#
includePageBreaks[source]#
setContentPath(value)[source]#
getContentPath()[source]#
setContentType(value)[source]#
getContentType()[source]#
setStoreContent(value)[source]#
getStoreContent()[source]#
setTitleFontSize(value)[source]#
getTitleFontSize()[source]#
setInferTableStructure(value)[source]#
getInferTableStructure()[source]#
setIncludePageBreaks(value)[source]#
getIncludePageBreaks()[source]#