`sparknlp.partition.partition`#

Contains the Partition annotator for reading and processing various document types.

Module Contents#

Classes#

Partition

A unified interface for extracting structured content from various document types

class Partition(**kwargs)[source]#

A unified interface for extracting structured content from various document types using Spark NLP readers.

This class supports reading from files, URLs, in-memory strings, or byte arrays, and returns parsed output as a structured Spark DataFrame.

Supported formats include: - Plain text - HTML - Word (.doc/.docx) - Excel (.xls/.xlsx) - PowerPoint (.ppt/.pptx) - Email files (.eml, .msg) - PDFs

Parameters:

paramsdict, optional

Configuration parameters, including:

content_typestr
Override automatic file type detection.
store_contentbool
Include raw file content in the output DataFrame.
timeoutint
Timeout for fetching HTML content.
title_font_sizeint
Font size used to identify titles.
include_page_breaksbool
Tag content with page break metadata.
group_broken_paragraphsbool
Merge broken lines into full paragraphs.
title_length_sizeint
Max character length to qualify as title.
paragraph_splitstr
Regex to detect paragraph boundaries.
short_line_word_thresholdint
Max words in a line to be considered short.
thresholdfloat
Ratio of empty lines for switching grouping.
max_line_countint
Max lines evaluated in paragraph analysis.
include_slide_notesbool
Include speaker notes in output.
infer_table_structurebool
Generate HTML table structure.
append_cellsbool
Merge Excel rows into one block.
cell_separatorstr
Join cell values in a row.
add_attachment_contentbool
Include text of plain-text attachments.
headersdict
Request headers when using URLs.

Examples

Reading Text Files

>>> txt_directory = "/content/txtfiles/reader/txt"
>>> partition_df = Partition(content_type="text/plain").partition(txt_directory)
>>> partition_df.show()
>>> partition_df = Partition().partition("./email-files/test-several-attachments.eml")
>>> partition_df.show()
>>> partition_df = Partition().partition(
...     "https://www.wikipedia.com",
...     headers={"Accept-Language": "es-ES"}
... )
>>> partition_df.show()
+--------------------+--------------------+
|                path|                 txt|
+--------------------+--------------------+
|file:/content/txt...|[{Title, BIG DATA...|
+--------------------+--------------------+

Reading Email Files

>>> partition_df = Partition().partition("./email-files/test-several-attachments.eml")
>>> partition_df.show()
+--------------------+--------------------+
|                path|               email|
+--------------------+--------------------+
|file:/content/ema...|[{Title, Test Sev...|
+--------------------+--------------------+

Reading Webpages

>>> partition_df = Partition().partition("https://www.wikipedia.com", headers = {"Accept-Language": "es-ES"})
>>> partition_df.show()
+--------------------+--------------------+
|                 url|                html|
+--------------------+--------------------+
|https://www.wikip...|[{Title, Wikipedi...|
+--------------------+--------------------+

For more examples, refer to: examples/python/data-preprocessing/SparkNLP_Partition_Reader_Demo.ipynb

spark[source]#

partition(path, headers=None)[source]#

Reads and parses content from a URL, file, or directory path.

Parameters:

pathstr: Path to file or directory. URLs and DFS are supported.
headersdict, optional: Headers for URL requests.

Returns:

pyspark.sql.DataFrame: DataFrame with parsed content.

partition_urls(path, headers=None)[source]#

Reads and parses content from multiple URLs.

Parameters:

pathlist[str]: List of URLs.
headersdict, optional: Request headers for URLs.

Returns:

pyspark.sql.DataFrame: DataFrame with parsed URL content.

Examples

>>> urls_df = Partition().partition_urls([
...     "https://www.wikipedia.org", "https://example.com/"
... ])
>>> urls_df.show()
+--------------------+--------------------+
|                 url|                html|
+--------------------+--------------------+
|https://www.wikip...|[{Title, Wikipedi...|
|https://example.com/|[{Title, Example ...|
+--------------------+--------------------+

>>> urls_df.printSchema()
 root
 |-- url: string (nullable = true)
 |-- html: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- elementType: string (nullable = true)
 |    |    |-- content: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)

partition_text(text)[source]#

Parses content from a raw text string.

Parameters:

textstr: Raw text input.

Returns:

pyspark.sql.DataFrame: DataFrame with parsed text.

Examples

>>> raw_text = (
...     "The big brown fox\n"
...     "was walking down the lane.\n"
...     "\n"
...     "At the end of the lane,\n"
...     "the fox met a bear."
... )
>>> text_df = Partition(group_broken_paragraphs=True).partition_text(text=raw_text)
>>> text_df.show()
+--------------------------------------+
|txt                                   |
+--------------------------------------+
|[{NarrativeText, The big brown fox was|
+--------------------------------------+
>>> text_df.printSchema()
root
 |-- txt: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- elementType: string (nullable = true)
 |    |    |-- content: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)

sparknlp.partition.partition#

Module Contents#

Classes#

`sparknlp.partition.partition`#