sparknlp.reader.reader2doc
#
Module Contents#
Classes#
The Reader2Doc annotator allows you to use reading files more smoothly within existing |
- class Reader2Doc[source]#
The Reader2Doc annotator allows you to use reading files more smoothly within existing Spark NLP workflows, enabling seamless reuse of your pipelines.
Reader2Doc can be used for extracting structured content from various document types using Spark NLP readers. It supports reading from many file types and returns parsed output as a structured Spark DataFrame.
Supported formats include:
Plain text
HTML
Word (.doc/.docx)
Excel (.xls/.xlsx)
PowerPoint (.ppt/.pptx)
Email files (.eml, .msg)
PDFs
Examples
>>> from johnsnowlabs.reader import Reader2Doc >>> from johnsnowlabs.nlp.base import DocumentAssembler >>> from pyspark.ml import Pipeline >>> # Initialize Reader2Doc for PDF files >>> reader2doc = Reader2Doc() \ ... .setContentType("application/pdf") \ ... .setContentPath(f"{pdf_directory}/") >>> # Build the pipeline with the Reader2Doc stage >>> pipeline = Pipeline(stages=[reader2doc]) >>> # Fit the pipeline to an empty DataFrame >>> pipeline_model = pipeline.fit(empty_data_set) >>> result_df = pipeline_model.transform(empty_data_set) >>> # Show the resulting DataFrame >>> result_df.show() +------------------------------------------------------------------------------------------------------------------------------------+ |document | +------------------------------------------------------------------------------------------------------------------------------------+ |[{'document', 0, 14, 'This is a Title', {'pageNumber': 1, 'elementType': 'Title', 'fileName': 'pdf-title.pdf'}, []}] | |[{'document', 15, 38, 'This is a narrative text', {'pageNumber': 1, 'elementType': 'NarrativeText', 'fileName': 'pdf-title.pdf'}, []}]| |[{'document', 39, 68, 'This is another narrative text', {'pageNumber': 1, 'elementType': 'NarrativeText', 'fileName': 'pdf-title.pdf'}, []}]| +------------------------------------------------------------------------------------------------------------------------------------+
- setContentPath(value)[source]#
Sets content path.
- Parameters:
- valuestr
contentPath path to files to read
- setContentType(value)[source]#
Set the content type to load following MIME specification
- Parameters:
- valuestr
content type to load following MIME specification
- setExplodeDocs(value)[source]#
Sets whether to explode the documents into separate rows.
- Parameters:
- valueboolean
- Whether to explode the documents into separate rows