sparknlp.partition.partition_properties#

Contains classes for partition properties used in reading various document types.

Module Contents#

Classes#

HasReaderProperties

Components that take parameters. This also provides an internal

HasEmailReaderProperties

Components that take parameters. This also provides an internal

HasExcelReaderProperties

Components that take parameters. This also provides an internal

HasHTMLReaderProperties

Components that take parameters. This also provides an internal

HasPowerPointProperties

Components that take parameters. This also provides an internal

HasTextReaderProperties

Components that take parameters. This also provides an internal

HasChunkerProperties

Components that take parameters. This also provides an internal

HasPdfProperties

Components that take parameters. This also provides an internal

class HasReaderProperties[source]#

Components that take parameters. This also provides an internal param map to store parameter values attached to the instance.

New in version 1.3.0.

inputCol[source]#
outputCol[source]#
contentPath[source]#
contentType[source]#
storeContent[source]#
titleFontSize[source]#
inferTableStructure[source]#
includePageBreaks[source]#
ignoreExceptions[source]#
explodeDocs[source]#
flattenOutput[source]#
titleThreshold[source]#
outputAsDocument[source]#
setInputCol(value)[source]#

Sets input column name.

Parameters:
valuestr

Name of the Input Column

setOutputCol(value)[source]#

Sets output column name.

Parameters:
valuestr

Name of the Output Column

setContentPath(value: str)[source]#

Sets content path.

Parameters:
valuestr

Path to the content source.

setContentType(value: str)[source]#

Sets content type following MIME specification.

Parameters:
valuestr

Content type string (MIME format).

setStoreContent(value: bool)[source]#

Sets whether to store raw file content.

Parameters:
valuebool

True to include raw file content, False otherwise.

setTitleFontSize(value: int)[source]#

Sets minimum font size for detecting titles.

Parameters:
valueint

Minimum font size threshold for title detection.

setInferTableStructure(value: bool)[source]#

Sets whether to infer table structure.

Parameters:
valuebool

True to generate HTML table representation, False otherwise.

setIncludePageBreaks(value: bool)[source]#

Sets whether to include page break metadata.

Parameters:
valuebool

True to detect and tag page breaks, False otherwise.

setIgnoreExceptions(value: bool)[source]#

Sets whether to ignore exceptions during processing.

Parameters:
valuebool

True to ignore exceptions, False otherwise.

setExplodeDocs(value: bool)[source]#

Sets whether to explode the documents into separate rows.

Parameters:
valuebool

True to split documents into multiple rows, False to keep them in one row.

setFlattenOutput(value)[source]#

Sets whether to flatten the output to plain text with minimal metadata.

ParametersF#

valuebool

If true, output is flattened to plain text with minimal metadata

setTitleThreshold(value)[source]#

Sets the minimum font size threshold for title detection in PDF documents.

Parameters:
valuefloat

Minimum font size threshold for title detection in PDF docs

setOutputAsDocument(value)[source]#

Sets whether to return all sentences joined into a single document.

Parameters:
valuebool

Whether to return all sentences joined into a single document

class HasEmailReaderProperties[source]#

Components that take parameters. This also provides an internal param map to store parameter values attached to the instance.

New in version 1.3.0.

addAttachmentContent[source]#
setAddAttachmentContent(value)[source]#

Sets whether to extract and include the textual content of plain-text attachments in the output.

Parameters:
valuebool

Whether to include text from plain-text attachments.

getAddAttachmentContent()[source]#

Gets whether to extract and include the textual content of plain-text attachments in the output.

Returns:
bool

Whether to include text from plain-text attachments.

class HasExcelReaderProperties[source]#

Components that take parameters. This also provides an internal param map to store parameter values attached to the instance.

New in version 1.3.0.

cellSeparator[source]#
appendCells[source]#
setCellSeparator(value)[source]#

Sets the string used to join cell values in a row when assembling textual output.

Parameters:
valuestr

Delimiter used to concatenate cell values.

getCellSeparator()[source]#

Gets the string used to join cell values in a row when assembling textual output.

Returns:
str

Delimiter used to concatenate cell values.

setAppendCells(value)[source]#

Sets whether to append all rows into a single content block.

Parameters:
valuebool

True to merge rows into one block, False for individual elements.

getAppendCells()[source]#

Gets whether to append all rows into a single content block.

Returns:
bool

True to merge rows into one block, False for individual elements.

class HasHTMLReaderProperties[source]#

Components that take parameters. This also provides an internal param map to store parameter values attached to the instance.

New in version 1.3.0.

timeout[source]#
outputFormat[source]#
setTimeout(value)[source]#

Sets the timeout (in seconds) for reading remote HTML resources.

Parameters:
valueint

Timeout in seconds for remote content retrieval.

getTimeout()[source]#

Gets the timeout value for reading remote HTML resources.

Returns:
int

Timeout in seconds.

setHeaders(headers: Dict[str, str])[source]#
setOutputFormat(value: str)[source]#

Sets output format for the table content.

Parameters:
valuestr

Output format for the table content.

class HasPowerPointProperties[source]#

Components that take parameters. This also provides an internal param map to store parameter values attached to the instance.

New in version 1.3.0.

includeSlideNotes[source]#
setIncludeSlideNotes(value)[source]#

Sets whether to extract speaker notes from slides.

Parameters:
valuebool

If True, notes are included as narrative text elements.

getIncludeSlideNotes()[source]#

Gets whether to extract speaker notes from slides.

Returns:
bool

True if notes are included as narrative text elements.

class HasTextReaderProperties[source]#

Components that take parameters. This also provides an internal param map to store parameter values attached to the instance.

New in version 1.3.0.

titleLengthSize[source]#
groupBrokenParagraphs[source]#
paragraphSplit[source]#
shortLineWordThreshold[source]#
maxLineCount[source]#
threshold[source]#
setTitleLengthSize(value)[source]#
getTitleLengthSize()[source]#
setGroupBrokenParagraphs(value)[source]#
getGroupBrokenParagraphs()[source]#
setParagraphSplit(value)[source]#
getParagraphSplit()[source]#
setShortLineWordThreshold(value)[source]#
getShortLineWordThreshold()[source]#
setMaxLineCount(value)[source]#
getMaxLineCount()[source]#
setThreshold(value)[source]#
getThreshold()[source]#
class HasChunkerProperties[source]#

Components that take parameters. This also provides an internal param map to store parameter values attached to the instance.

New in version 1.3.0.

chunkingStrategy[source]#
maxCharacters[source]#
newAfterNChars[source]#
overlap[source]#
combineTextUnderNChars[source]#
overlapAll[source]#
setChunkingStrategy(value)[source]#
setMaxCharacters(value)[source]#
setNewAfterNChars(value)[source]#
setOverlap(value)[source]#
setCombineTextUnderNChars(value)[source]#
setOverlapAll(value)[source]#
class HasPdfProperties[source]#

Components that take parameters. This also provides an internal param map to store parameter values attached to the instance.

New in version 1.3.0.

pageNumCol[source]#
originCol[source]#
partitionNum[source]#
storeSplittedPdf[source]#
splitPage[source]#
onlyPageNum[source]#
textStripper[source]#
sort[source]#
extractCoordinates[source]#
normalizeLigatures[source]#
readAsImage[source]#
setPageNumCol(value: str)[source]#

Sets page number output column name.

Parameters:
valuestr

Name of the column for page numbers.

setOriginCol(value: str)[source]#

Sets input column with original file path.

Parameters:
valuestr

Column name that stores the file path.

setPartitionNum(value: int)[source]#

Sets number of partitions.

Parameters:
valueint

Number of partitions to use.

setStoreSplittedPdf(value: bool)[source]#

Sets whether to store byte content of split PDF pages.

Parameters:
valuebool

True to store PDF page bytes, False otherwise.

setSplitPage(value: bool)[source]#

Sets whether to split PDF into pages.

Parameters:
valuebool

True to split per page, False otherwise.

setOnlyPageNum(value: bool)[source]#

Sets whether to extract only page numbers.

Parameters:
valuebool

True to extract only page numbers, False otherwise.

setTextStripper(value: str)[source]#

Sets text stripper type.

Parameters:
valuestr

Text stripper type for layout and formatting.

setSort(value: bool)[source]#

Sets whether to sort content on the page.

Parameters:
valuebool

True to sort content, False otherwise.

setExtractCoordinates(value: bool)[source]#

Sets whether to extract coordinates of text.

Parameters:
valuebool

True to extract coordinates, False otherwise.

setNormalizeLigatures(value: bool)[source]#

Sets whether to normalize ligatures (e.g., fl → f + l).

Parameters:
valuebool

True to normalize ligatures, False otherwise.

setReadAsImage(value: bool)[source]#

Sets whether to read PDF pages as images.

Parameters:
valuebool

True to read as images, False otherwise.