sparknlp.reader.reader2doc#

Module Contents#

Classes#

Reader2Doc

The Reader2Doc annotator allows you to use reading files more smoothly within existing

class Reader2Doc[source]#

The Reader2Doc annotator allows you to use reading files more smoothly within existing Spark NLP workflows, enabling seamless reuse of your pipelines.

Reader2Doc can be used for extracting structured content from various document types using Spark NLP readers. It supports reading from many file types and returns parsed output as a structured Spark DataFrame.

Supported formats include:

  • Plain text

  • HTML

  • Word (.doc/.docx)

  • Excel (.xls/.xlsx)

  • PowerPoint (.ppt/.pptx)

  • Email files (.eml, .msg)

  • PDFs

Examples

>>> from johnsnowlabs.reader import Reader2Doc
>>> from johnsnowlabs.nlp.base import DocumentAssembler
>>> from pyspark.ml import Pipeline
>>> # Initialize Reader2Doc for PDF files
>>> reader2doc = Reader2Doc() \
...     .setContentType("application/pdf") \
...     .setContentPath(f"{pdf_directory}/")
>>> # Build the pipeline with the Reader2Doc stage
>>> pipeline = Pipeline(stages=[reader2doc])
>>> # Fit the pipeline to an empty DataFrame
>>> pipeline_model = pipeline.fit(empty_data_set)
>>> result_df = pipeline_model.transform(empty_data_set)
>>> # Show the resulting DataFrame
>>> result_df.show()
+------------------------------------------------------------------------------------------------------------------------------------+
|document                                                                                                                            |
+------------------------------------------------------------------------------------------------------------------------------------+
|[{'document', 0, 14, 'This is a Title', {'pageNumber': 1, 'elementType': 'Title', 'fileName': 'pdf-title.pdf'}, []}]               |
|[{'document', 15, 38, 'This is a narrative text', {'pageNumber': 1, 'elementType': 'NarrativeText', 'fileName': 'pdf-title.pdf'}, []}]|
|[{'document', 39, 68, 'This is another narrative text', {'pageNumber': 1, 'elementType': 'NarrativeText', 'fileName': 'pdf-title.pdf'}, []}]|
+------------------------------------------------------------------------------------------------------------------------------------+
name = 'Reader2Doc'[source]#
outputAnnotatorType = 'document'[source]#
contentPath[source]#
outputCol[source]#
contentType[source]#
explodeDocs[source]#
flattenOutput[source]#
titleThreshold[source]#
setParams()[source]#
setContentPath(value)[source]#

Sets content path.

Parameters:
valuestr

contentPath path to files to read

setContentType(value)[source]#

Set the content type to load following MIME specification

Parameters:
valuestr

content type to load following MIME specification

setExplodeDocs(value)[source]#

Sets whether to explode the documents into separate rows.

Parameters:
valueboolean
Whether to explode the documents into separate rows
setOutputCol(value)[source]#

Sets output column name.

Parameters:
valuestr

Name of the Output Column

setFlattenOutput(value)[source]#

Sets whether to flatten the output to plain text with minimal metadata.

Parameters:
valuebool

If true, output is flattened to plain text with minimal metadata

setTitleThreshold(value)[source]#

Sets the minimum font size threshold for title detection in PDF documents.

Parameters:
valuefloat

Minimum font size threshold for title detection in PDF docs