sparknlp.partition.partition
#
Contains the Partition annotator for reading and processing various document types.
Module Contents#
Classes#
A unified interface for extracting structured content from various document types |
- class Partition(**kwargs)[source]#
A unified interface for extracting structured content from various document types using Spark NLP readers.
This class supports reading from files, URLs, in-memory strings, or byte arrays, and returns parsed output as a structured Spark DataFrame.
Supported formats include: - Plain text - HTML - Word (.doc/.docx) - Excel (.xls/.xlsx) - PowerPoint (.ppt/.pptx) - Email files (.eml, .msg) - PDFs
- Parameters:
- paramsdict, optional
Configuration parameters, including:
- content_typestr
Override automatic file type detection.
- store_contentbool
Include raw file content in the output DataFrame.
- timeoutint
Timeout for fetching HTML content.
- title_font_sizeint
Font size used to identify titles.
- include_page_breaksbool
Tag content with page break metadata.
- group_broken_paragraphsbool
Merge broken lines into full paragraphs.
- title_length_sizeint
Max character length to qualify as title.
- paragraph_splitstr
Regex to detect paragraph boundaries.
- short_line_word_thresholdint
Max words in a line to be considered short.
- thresholdfloat
Ratio of empty lines for switching grouping.
- max_line_countint
Max lines evaluated in paragraph analysis.
- include_slide_notesbool
Include speaker notes in output.
- infer_table_structurebool
Generate HTML table structure.
- append_cellsbool
Merge Excel rows into one block.
- cell_separatorstr
Join cell values in a row.
- add_attachment_contentbool
Include text of plain-text attachments.
- headersdict
Request headers when using URLs.
Examples
Reading Text Files
>>> txt_directory = "/content/txtfiles/reader/txt" >>> partition_df = Partition(content_type="text/plain").partition(txt_directory) >>> partition_df.show() >>> partition_df = Partition().partition("./email-files/test-several-attachments.eml") >>> partition_df.show() >>> partition_df = Partition().partition( ... "https://www.wikipedia.com", ... headers={"Accept-Language": "es-ES"} ... ) >>> partition_df.show() +--------------------+--------------------+ | path| txt| +--------------------+--------------------+ |file:/content/txt...|[{Title, BIG DATA...| +--------------------+--------------------+
Reading Email Files
>>> partition_df = Partition().partition("./email-files/test-several-attachments.eml") >>> partition_df.show() +--------------------+--------------------+ | path| email| +--------------------+--------------------+ |file:/content/ema...|[{Title, Test Sev...| +--------------------+--------------------+
Reading Webpages
>>> partition_df = Partition().partition("https://www.wikipedia.com", headers = {"Accept-Language": "es-ES"}) >>> partition_df.show() +--------------------+--------------------+ | url| html| +--------------------+--------------------+ |https://www.wikip...|[{Title, Wikipedi...| +--------------------+--------------------+
For more examples, refer to: examples/python/data-preprocessing/SparkNLP_Partition_Reader_Demo.ipynb
- partition(path, headers=None)[source]#
Reads and parses content from a URL, file, or directory path.
- Parameters:
- pathstr
Path to file or directory. URLs and DFS are supported.
- headersdict, optional
Headers for URL requests.
- Returns:
- pyspark.sql.DataFrame
DataFrame with parsed content.
- partition_urls(path, headers=None)[source]#
Reads and parses content from multiple URLs.
- Parameters:
- pathlist[str]
List of URLs.
- headersdict, optional
Request headers for URLs.
- Returns:
- pyspark.sql.DataFrame
DataFrame with parsed URL content.
Examples
>>> urls_df = Partition().partition_urls([ ... "https://www.wikipedia.org", "https://example.com/" ... ]) >>> urls_df.show() +--------------------+--------------------+ | url| html| +--------------------+--------------------+ |https://www.wikip...|[{Title, Wikipedi...| |https://example.com/|[{Title, Example ...| +--------------------+--------------------+
>>> urls_df.printSchema() root |-- url: string (nullable = true) |-- html: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- elementType: string (nullable = true) | | |-- content: string (nullable = true) | | |-- metadata: map (nullable = true) | | | |-- key: string | | | |-- value: string (valueContainsNull = true)
- partition_text(text)[source]#
Parses content from a raw text string.
- Parameters:
- textstr
Raw text input.
- Returns:
- pyspark.sql.DataFrame
DataFrame with parsed text.
Examples
>>> raw_text = ( ... "The big brown fox\n" ... "was walking down the lane.\n" ... "\n" ... "At the end of the lane,\n" ... "the fox met a bear." ... ) >>> text_df = Partition(group_broken_paragraphs=True).partition_text(text=raw_text) >>> text_df.show() +--------------------------------------+ |txt | +--------------------------------------+ |[{NarrativeText, The big brown fox was| +--------------------------------------+ >>> text_df.printSchema() root |-- txt: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- elementType: string (nullable = true) | | |-- content: string (nullable = true) | | |-- metadata: map (nullable = true) | | | |-- key: string | | | |-- value: string (valueContainsNull = true)