sparknlp.partition.partition#

Contains the Partition annotator for reading and processing various document types.

Module Contents#

Classes#

Partition

A unified interface for extracting structured content from various document types

class Partition(**kwargs)[source]#

A unified interface for extracting structured content from various document types using Spark NLP readers.

This class supports reading from files, URLs, in-memory strings, or byte arrays, and returns parsed output as a structured Spark DataFrame.

Supported formats include: - Plain text - HTML - Word (.doc/.docx) - Excel (.xls/.xlsx) - PowerPoint (.ppt/.pptx) - Email files (.eml, .msg) - PDFs

Parameters:
paramsdict, optional

Configuration parameters, including:

  • content_typestr

    Override automatic file type detection.

  • store_contentbool

    Include raw file content in the output DataFrame.

  • timeoutint

    Timeout for fetching HTML content.

  • title_font_sizeint

    Font size used to identify titles.

  • include_page_breaksbool

    Tag content with page break metadata.

  • group_broken_paragraphsbool

    Merge broken lines into full paragraphs.

  • title_length_sizeint

    Max character length to qualify as title.

  • paragraph_splitstr

    Regex to detect paragraph boundaries.

  • short_line_word_thresholdint

    Max words in a line to be considered short.

  • thresholdfloat

    Ratio of empty lines for switching grouping.

  • max_line_countint

    Max lines evaluated in paragraph analysis.

  • include_slide_notesbool

    Include speaker notes in output.

  • infer_table_structurebool

    Generate HTML table structure.

  • append_cellsbool

    Merge Excel rows into one block.

  • cell_separatorstr

    Join cell values in a row.

  • add_attachment_contentbool

    Include text of plain-text attachments.

  • headersdict

    Request headers when using URLs.

Examples

Reading Text Files

>>> txt_directory = "/content/txtfiles/reader/txt"
>>> partition_df = Partition(content_type="text/plain").partition(txt_directory)
>>> partition_df.show()
>>> partition_df = Partition().partition("./email-files/test-several-attachments.eml")
>>> partition_df.show()
>>> partition_df = Partition().partition(
...     "https://www.wikipedia.com",
...     headers={"Accept-Language": "es-ES"}
... )
>>> partition_df.show()
+--------------------+--------------------+
|                path|                 txt|
+--------------------+--------------------+
|file:/content/txt...|[{Title, BIG DATA...|
+--------------------+--------------------+

Reading Email Files

>>> partition_df = Partition().partition("./email-files/test-several-attachments.eml")
>>> partition_df.show()
+--------------------+--------------------+
|                path|               email|
+--------------------+--------------------+
|file:/content/ema...|[{Title, Test Sev...|
+--------------------+--------------------+

Reading Webpages

>>> partition_df = Partition().partition("https://www.wikipedia.com", headers = {"Accept-Language": "es-ES"})
>>> partition_df.show()
+--------------------+--------------------+
|                 url|                html|
+--------------------+--------------------+
|https://www.wikip...|[{Title, Wikipedi...|
+--------------------+--------------------+

For more examples, refer to: examples/python/data-preprocessing/SparkNLP_Partition_Reader_Demo.ipynb

spark[source]#
partition(path, headers=None)[source]#

Reads and parses content from a URL, file, or directory path.

Parameters:
pathstr

Path to file or directory. URLs and DFS are supported.

headersdict, optional

Headers for URL requests.

Returns:
pyspark.sql.DataFrame

DataFrame with parsed content.

partition_urls(path, headers=None)[source]#

Reads and parses content from multiple URLs.

Parameters:
pathlist[str]

List of URLs.

headersdict, optional

Request headers for URLs.

Returns:
pyspark.sql.DataFrame

DataFrame with parsed URL content.

Examples

>>> urls_df = Partition().partition_urls([
...     "https://www.wikipedia.org", "https://example.com/"
... ])
>>> urls_df.show()
+--------------------+--------------------+
|                 url|                html|
+--------------------+--------------------+
|https://www.wikip...|[{Title, Wikipedi...|
|https://example.com/|[{Title, Example ...|
+--------------------+--------------------+
>>> urls_df.printSchema()
 root
 |-- url: string (nullable = true)
 |-- html: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- elementType: string (nullable = true)
 |    |    |-- content: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
partition_text(text)[source]#

Parses content from a raw text string.

Parameters:
textstr

Raw text input.

Returns:
pyspark.sql.DataFrame

DataFrame with parsed text.

Examples

>>> raw_text = (
...     "The big brown fox\n"
...     "was walking down the lane.\n"
...     "\n"
...     "At the end of the lane,\n"
...     "the fox met a bear."
... )
>>> text_df = Partition(group_broken_paragraphs=True).partition_text(text=raw_text)
>>> text_df.show()
+--------------------------------------+
|txt                                   |
+--------------------------------------+
|[{NarrativeText, The big brown fox was|
+--------------------------------------+
>>> text_df.printSchema()
root
 |-- txt: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- elementType: string (nullable = true)
 |    |    |-- content: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)