sparknlp.reader.reader2doc#

Module Contents#

Classes#

Reader2Doc

The Reader2Doc annotator allows you to use reading files more smoothly within existing

class Reader2Doc[source]#

The Reader2Doc annotator allows you to use reading files more smoothly within existing Spark NLP workflows, enabling seamless reuse of your pipelines.

Reader2Doc can be used for extracting structured content from various document types using Spark NLP readers. It supports reading from many file types and returns parsed output as a structured Spark DataFrame.

Supported formats include:

  • Plain text

  • HTML

  • Word (.doc/.docx)

  • Excel (.xls/.xlsx)

  • PowerPoint (.ppt/.pptx)

  • Email files (.eml, .msg)

  • PDFs

Examples

>>> from johnsnowlabs.reader import Reader2Doc
>>> from johnsnowlabs.nlp.base import DocumentAssembler
>>> from pyspark.ml import Pipeline
>>> # Initialize Reader2Doc for PDF files
>>> reader2doc = Reader2Doc() \
...     .setContentType("application/pdf") \
...     .setContentPath(f"{pdf_directory}/")
>>> # Build the pipeline with the Reader2Doc stage
>>> pipeline = Pipeline(stages=[reader2doc])
>>> # Fit the pipeline to an empty DataFrame
>>> pipeline_model = pipeline.fit(empty_data_set)
>>> result_df = pipeline_model.transform(empty_data_set)
>>> # Show the resulting DataFrame
>>> result_df.show()
+------------------------------------------------------------------------------------------------------------------------------------+
|document                                                                                                                            |
+------------------------------------------------------------------------------------------------------------------------------------+
|[{'document', 0, 14, 'This is a Title', {'pageNumber': 1, 'elementType': 'Title', 'fileName': 'pdf-title.pdf'}, []}]               |
|[{'document', 15, 38, 'This is a narrative text', {'pageNumber': 1, 'elementType': 'NarrativeText', 'fileName': 'pdf-title.pdf'}, []}]|
|[{'document', 39, 68, 'This is another narrative text', {'pageNumber': 1, 'elementType': 'NarrativeText', 'fileName': 'pdf-title.pdf'}, []}]|
+------------------------------------------------------------------------------------------------------------------------------------+
name = 'Reader2Doc'[source]#
outputAnnotatorType = 'document'[source]#
excludeNonText[source]#
setExcludeNonText(value)[source]#

Sets whether to exclude non-text content from the output.

Parameters:
valuebool

Whether to exclude non-text content from the output. Default is False.

setParams()[source]#