`sparknlp.reader.reader2doc`#

Module Contents#

Classes#

Reader2Doc

The Reader2Doc annotator allows you to use reading files more smoothly within existing

class Reader2Doc[source]#

The Reader2Doc annotator allows you to use reading files more smoothly within existing Spark NLP workflows, enabling seamless reuse of your pipelines.

Reader2Doc can be used for extracting structured content from various document types using Spark NLP readers. It supports reading from many file types and returns parsed output as a structured Spark DataFrame.

Supported formats include:

Plain text
HTML
Word (.doc/.docx)
Excel (.xls/.xlsx)
PowerPoint (.ppt/.pptx)
Email files (.eml, .msg)
PDFs

Examples

>>> from johnsnowlabs.reader import Reader2Doc
>>> from johnsnowlabs.nlp.base import DocumentAssembler
>>> from pyspark.ml import Pipeline
>>> # Initialize Reader2Doc for PDF files
>>> reader2doc = Reader2Doc() \
...     .setContentType("application/pdf") \
...     .setContentPath(f"{pdf_directory}/")
>>> # Build the pipeline with the Reader2Doc stage
>>> pipeline = Pipeline(stages=[reader2doc])
>>> # Fit the pipeline to an empty DataFrame
>>> pipeline_model = pipeline.fit(empty_data_set)
>>> result_df = pipeline_model.transform(empty_data_set)
>>> # Show the resulting DataFrame
>>> result_df.show()
+------------------------------------------------------------------------------------------------------------------------------------+
|document                                                                                                                            |
+------------------------------------------------------------------------------------------------------------------------------------+
|[{'document', 0, 14, 'This is a Title', {'pageNumber': 1, 'elementType': 'Title', 'fileName': 'pdf-title.pdf'}, []}]               |
|[{'document', 15, 38, 'This is a narrative text', {'pageNumber': 1, 'elementType': 'NarrativeText', 'fileName': 'pdf-title.pdf'}, []}]|
|[{'document', 39, 68, 'This is another narrative text', {'pageNumber': 1, 'elementType': 'NarrativeText', 'fileName': 'pdf-title.pdf'}, []}]|
+------------------------------------------------------------------------------------------------------------------------------------+

name = 'Reader2Doc'[source]#

outputAnnotatorType = 'document'[source]#

contentPath[source]#

outputCol[source]#

contentType[source]#

explodeDocs[source]#

flattenOutput[source]#

titleThreshold[source]#

setParams()[source]#

setContentPath(value)[source]#

Sets content path.

Parameters:

valuestr: contentPath path to files to read

setContentType(value)[source]#

Set the content type to load following MIME specification

Parameters:

valuestr: content type to load following MIME specification

setExplodeDocs(value)[source]#

Sets whether to explode the documents into separate rows.

Parameters:

valueboolean
Whether to explode the documents into separate rows

setOutputCol(value)[source]#

Sets output column name.

Parameters:

valuestr: Name of the Output Column

setFlattenOutput(value)[source]#

Sets whether to flatten the output to plain text with minimal metadata.

Parameters:

valuebool: If true, output is flattened to plain text with minimal metadata

setTitleThreshold(value)[source]#

Sets the minimum font size threshold for title detection in PDF documents.

Parameters:

valuefloat: Minimum font size threshold for title detection in PDF docs

sparknlp.reader.reader2doc#

Module Contents#

Classes#

`sparknlp.reader.reader2doc`#