sparknlp.partition.partition_transformer
#
Contains the PartitionTransformer class for reading various types of documents into chunks.
Module Contents#
Classes#
The PartitionTransformer annotator allows you to use the Partition feature more smoothly |
- class PartitionTransformer(classname='com.johnsnowlabs.partition.PartitionTransformer', java_model=None)[source]#
The PartitionTransformer annotator allows you to use the Partition feature more smoothly within existing Spark NLP workflows, enabling seamless reuse of your pipelines.
It supports reading from files, URLs, in-memory strings, or byte arrays, and works within a Spark NLP pipeline.
Supported formats include: - Plain text - HTML - Word (.doc/.docx) - Excel (.xls/.xlsx) - PowerPoint (.ppt/.pptx) - Email files (.eml, .msg) - PDFs
- Parameters:
- inputColslist of str
Names of input columns (typically from DocumentAssembler).
- outputColstr
Name of the column to store the output.
- contentTypestr
The type of content: e.g., “text”, “url”, “file”, etc.
- headersdict, optional
Headers to be used if content type is a URL.
Examples
>>> dataset = spark.createDataFrame([ ... ("https://www.blizzard.com",), ... ], ["text"])
>>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document")
>>> partition = PartitionTransformer() \ ... .setInputCols(["document"]) \ ... .setOutputCol("partition") \ ... .setContentType("url") \ ... .setHeaders({"Accept-Language": "es-ES"})
>>> pipeline = Pipeline(stages=[documentAssembler, partition]) >>> pipelineModel = pipeline.fit(dataset) >>> resultDf = pipelineModel.transform(dataset) >>> resultDf.show() +--------------------+--------------------+--------------------+ | text| document| partition| +--------------------+--------------------+--------------------+ |https://www.blizz...|[{Title, Juegos d...|[{document, 0, 16...| |https://www.googl...|[{Title, Gmail Im...|[{document, 0, 28...| +--------------------+--------------------+--------------------+