sparknlp.partition.partition_properties#

Contains classes for partition properties used in reading various document types.

Module Contents#

Classes#

HasReaderProperties

Components that take parameters. This also provides an internal

HasEmailReaderProperties

Components that take parameters. This also provides an internal

HasExcelReaderProperties

Components that take parameters. This also provides an internal

HasHTMLReaderProperties

Components that take parameters. This also provides an internal

HasPowerPointProperties

Components that take parameters. This also provides an internal

HasTextReaderProperties

Components that take parameters. This also provides an internal

HasChunkerProperties

Components that take parameters. This also provides an internal

HasPdfProperties

Components that take parameters. This also provides an internal

class HasReaderProperties[source]#

Components that take parameters. This also provides an internal param map to store parameter values attached to the instance.

New in version 1.3.0.

inputCol[source]#
outputCol[source]#
contentPath[source]#
contentType[source]#
storeContent[source]#
titleFontSize[source]#
inferTableStructure[source]#
includePageBreaks[source]#
ignoreExceptions[source]#
explodeDocs[source]#
flattenOutput[source]#
titleThreshold[source]#
outputAsDocument[source]#
setInputCol(value)[source]#

Sets input column name.

Parameters:
valuestr

Name of the Input Column

setOutputCol(value)[source]#

Sets output column name.

Parameters:
valuestr

Name of the Output Column

setContentPath(value: str)[source]#

Sets content path.

Parameters:
valuestr

Path to the content source.

setContentType(value: str)[source]#

Sets content type following MIME specification.

Parameters:
valuestr

Content type string (MIME format).

setStoreContent(value: bool)[source]#

Sets whether to store raw file content.

Parameters:
valuebool

True to include raw file content, False otherwise.

setTitleFontSize(value: int)[source]#

Sets minimum font size for detecting titles.

Parameters:
valueint

Minimum font size threshold for title detection.

setInferTableStructure(value: bool)[source]#

Sets whether to infer table structure.

Parameters:
valuebool

True to generate HTML table representation, False otherwise.

setIncludePageBreaks(value: bool)[source]#

Sets whether to include page break metadata.

Parameters:
valuebool

True to detect and tag page breaks, False otherwise.

setIgnoreExceptions(value: bool)[source]#

Sets whether to ignore exceptions during processing.

Parameters:
valuebool

True to ignore exceptions, False otherwise.

setExplodeDocs(value: bool)[source]#

Sets whether to explode the documents into separate rows.

Parameters:
valuebool

True to split documents into multiple rows, False to keep them in one row.

setFlattenOutput(value)[source]#

Sets whether to flatten the output to plain text with minimal metadata.

ParametersF#

valuebool

If true, output is flattened to plain text with minimal metadata

setTitleThreshold(value)[source]#

Sets the minimum font size threshold for title detection in PDF documents.

Parameters:
valuefloat

Minimum font size threshold for title detection in PDF docs

setOutputAsDocument(value)[source]#

Sets whether to return all sentences joined into a single document.

Parameters:
valuebool

Whether to return all sentences joined into a single document

class HasEmailReaderProperties[source]#

Components that take parameters. This also provides an internal param map to store parameter values attached to the instance.

New in version 1.3.0.

addAttachmentContent[source]#
setAddAttachmentContent(value)[source]#

Sets whether to extract and include the textual content of plain-text attachments in the output.

Parameters:
valuebool

Whether to include text from plain-text attachments.

getAddAttachmentContent()[source]#

Gets whether to extract and include the textual content of plain-text attachments in the output.

Returns:
bool

Whether to include text from plain-text attachments.

class HasExcelReaderProperties[source]#

Components that take parameters. This also provides an internal param map to store parameter values attached to the instance.

New in version 1.3.0.

cellSeparator[source]#
appendCells[source]#
setCellSeparator(value)[source]#

Sets the string used to join cell values in a row when assembling textual output.

Parameters:
valuestr

Delimiter used to concatenate cell values.

getCellSeparator()[source]#

Gets the string used to join cell values in a row when assembling textual output.

Returns:
str

Delimiter used to concatenate cell values.

setAppendCells(value)[source]#

Sets whether to append all rows into a single content block.

Parameters:
valuebool

True to merge rows into one block, False for individual elements.

getAppendCells()[source]#

Gets whether to append all rows into a single content block.

Returns:
bool

True to merge rows into one block, False for individual elements.

class HasHTMLReaderProperties[source]#

Components that take parameters. This also provides an internal param map to store parameter values attached to the instance.

New in version 1.3.0.

timeout[source]#
outputFormat[source]#
setTimeout(value)[source]#

Sets the timeout (in seconds) for reading remote HTML resources.

Parameters:
valueint

Timeout in seconds for remote content retrieval.

getTimeout()[source]#

Gets the timeout value for reading remote HTML resources.

Returns:
int

Timeout in seconds.

setHeaders(headers: Dict[str, str])[source]#
setOutputFormat(value: str)[source]#

Sets output format for the table content.

Parameters:
valuestr

Output format for the table content.

class HasPowerPointProperties[source]#

Components that take parameters. This also provides an internal param map to store parameter values attached to the instance.

New in version 1.3.0.

includeSlideNotes[source]#
setIncludeSlideNotes(value)[source]#

Sets whether to extract speaker notes from slides.

Parameters:
valuebool

If True, notes are included as narrative text elements.

getIncludeSlideNotes()[source]#

Gets whether to extract speaker notes from slides.

Returns:
bool

True if notes are included as narrative text elements.

class HasTextReaderProperties[source]#

Components that take parameters. This also provides an internal param map to store parameter values attached to the instance.

New in version 1.3.0.

titleLengthSize[source]#
groupBrokenParagraphs[source]#
paragraphSplit[source]#
shortLineWordThreshold[source]#
maxLineCount[source]#
threshold[source]#
extractTagAttributes[source]#
setTitleLengthSize(value)[source]#

Set the maximum character length used to identify title blocks.

Parameters:
valueint

Maximum number of characters a text block can have to be considered a title.

Returns:
self

The instance with updated titleLengthSize parameter.

getTitleLengthSize()[source]#

Get the configured maximum title length.

Returns:
int

The maximum character length used to detect title blocks.

setGroupBrokenParagraphs(value)[source]#

Enable or disable grouping of broken paragraphs.

Parameters:
valuebool

True to merge fragmented lines into paragraphs, False to leave lines as-is.

Returns:
self

The instance with updated groupBrokenParagraphs parameter.

getGroupBrokenParagraphs()[source]#

Get whether broken paragraph grouping is enabled.

Returns:
bool

True if grouping of broken paragraphs is enabled, False otherwise.

setParagraphSplit(value)[source]#

Set the regex pattern used to split paragraphs when grouping broken paragraphs.

Parameters:
valuestr

Regular expression string used to detect paragraph boundaries.

Returns:
self

The instance with updated paragraphSplit parameter.

getParagraphSplit()[source]#

Get the paragraph-splitting regex pattern.

Returns:
str

The regex pattern used to detect paragraph boundaries.

setShortLineWordThreshold(value)[source]#

Set the maximum word count for a line to be considered short.

Parameters:
valueint

Number of words under which a line is considered ‘short’.

Returns:
self

The instance with updated shortLineWordThreshold parameter.

getShortLineWordThreshold()[source]#

Get the short line word threshold.

Returns:
int

Word count threshold for short lines used in paragraph grouping.

setMaxLineCount(value)[source]#

Set the maximum number of lines to inspect when estimating paragraph layout.

Parameters:
valueint

Maximum number of lines to evaluate for layout heuristics.

Returns:
self

The instance with updated maxLineCount parameter.

getMaxLineCount()[source]#

Get the maximum number of lines used for layout heuristics.

Returns:
int

The configured maximum number of lines to consider.

setThreshold(value)[source]#

Set the empty-line ratio threshold for paragraph grouping decision.

Parameters:
valuefloat

Ratio (0.0-1.0) of empty lines used to switch grouping strategies.

Returns:
self

The instance with updated threshold parameter.

getThreshold()[source]#

Get the configured empty-line threshold ratio.

Returns:
float

The ratio used to decide paragraph grouping strategy.

setExtractTagAttributes(attributes: list[str])[source]#

Specify which tag attributes should have their values extracted as text when parsing tag-based formats (e.g., HTML or XML).

Parameters:

attributes – list of attribute names to extract

Returns:

this instance with the updated extractTagAttributes parameter

getExtractTagAttributes()[source]#

Get the list of tag attribute names configured to be extracted.

Returns:
list[str]

The attribute names whose values will be extracted as text.

class HasChunkerProperties[source]#

Components that take parameters. This also provides an internal param map to store parameter values attached to the instance.

New in version 1.3.0.

chunkingStrategy[source]#
maxCharacters[source]#
newAfterNChars[source]#
overlap[source]#
combineTextUnderNChars[source]#
overlapAll[source]#
setChunkingStrategy(value)[source]#
setMaxCharacters(value)[source]#
setNewAfterNChars(value)[source]#
setOverlap(value)[source]#
setCombineTextUnderNChars(value)[source]#
setOverlapAll(value)[source]#
class HasPdfProperties[source]#

Components that take parameters. This also provides an internal param map to store parameter values attached to the instance.

New in version 1.3.0.

pageNumCol[source]#
originCol[source]#
partitionNum[source]#
storeSplittedPdf[source]#
splitPage[source]#
onlyPageNum[source]#
textStripper[source]#
sort[source]#
extractCoordinates[source]#
normalizeLigatures[source]#
readAsImage[source]#
setPageNumCol(value: str)[source]#

Sets page number output column name.

Parameters:
valuestr

Name of the column for page numbers.

setOriginCol(value: str)[source]#

Sets input column with original file path.

Parameters:
valuestr

Column name that stores the file path.

setPartitionNum(value: int)[source]#

Sets number of partitions.

Parameters:
valueint

Number of partitions to use.

setStoreSplittedPdf(value: bool)[source]#

Sets whether to store byte content of split PDF pages.

Parameters:
valuebool

True to store PDF page bytes, False otherwise.

setSplitPage(value: bool)[source]#

Sets whether to split PDF into pages.

Parameters:
valuebool

True to split per page, False otherwise.

setOnlyPageNum(value: bool)[source]#

Sets whether to extract only page numbers.

Parameters:
valuebool

True to extract only page numbers, False otherwise.

setTextStripper(value: str)[source]#

Sets text stripper type.

Parameters:
valuestr

Text stripper type for layout and formatting.

setSort(value: bool)[source]#

Sets whether to sort content on the page.

Parameters:
valuebool

True to sort content, False otherwise.

setExtractCoordinates(value: bool)[source]#

Sets whether to extract coordinates of text.

Parameters:
valuebool

True to extract coordinates, False otherwise.

setNormalizeLigatures(value: bool)[source]#

Sets whether to normalize ligatures (e.g., fl → f + l).

Parameters:
valuebool

True to normalize ligatures, False otherwise.

setReadAsImage(value: bool)[source]#

Sets whether to read PDF pages as images.

Parameters:
valuebool

True to read as images, False otherwise.