sparknlp.partition.partition_properties#

Contains classes for partition properties used in reading various document types.

Module Contents#

Classes#

HasEmailReaderProperties

Components that take parameters. This also provides an internal

HasExcelReaderProperties

Components that take parameters. This also provides an internal

HasHTMLReaderProperties

Components that take parameters. This also provides an internal

HasPowerPointProperties

Components that take parameters. This also provides an internal

HasTextReaderProperties

Components that take parameters. This also provides an internal

HasChunkerProperties

Components that take parameters. This also provides an internal

class HasEmailReaderProperties[source]#

Components that take parameters. This also provides an internal param map to store parameter values attached to the instance.

New in version 1.3.0.

addAttachmentContent[source]#
setAddAttachmentContent(value)[source]#

Sets whether to extract and include the textual content of plain-text attachments in the output.

Parameters:
valuebool

Whether to include text from plain-text attachments.

getAddAttachmentContent()[source]#

Gets whether to extract and include the textual content of plain-text attachments in the output.

Returns:
bool

Whether to include text from plain-text attachments.

class HasExcelReaderProperties[source]#

Components that take parameters. This also provides an internal param map to store parameter values attached to the instance.

New in version 1.3.0.

cellSeparator[source]#
appendCells[source]#
setCellSeparator(value)[source]#

Sets the string used to join cell values in a row when assembling textual output.

Parameters:
valuestr

Delimiter used to concatenate cell values.

getCellSeparator()[source]#

Gets the string used to join cell values in a row when assembling textual output.

Returns:
str

Delimiter used to concatenate cell values.

setAppendCells(value)[source]#

Sets whether to append all rows into a single content block.

Parameters:
valuebool

True to merge rows into one block, False for individual elements.

getAppendCells()[source]#

Gets whether to append all rows into a single content block.

Returns:
bool

True to merge rows into one block, False for individual elements.

class HasHTMLReaderProperties[source]#

Components that take parameters. This also provides an internal param map to store parameter values attached to the instance.

New in version 1.3.0.

timeout[source]#
setTimeout(value)[source]#

Sets the timeout (in seconds) for reading remote HTML resources.

Parameters:
valueint

Timeout in seconds for remote content retrieval.

getTimeout()[source]#

Gets the timeout value for reading remote HTML resources.

Returns:
int

Timeout in seconds.

setHeaders(headers: Dict[str, str])[source]#
class HasPowerPointProperties[source]#

Components that take parameters. This also provides an internal param map to store parameter values attached to the instance.

New in version 1.3.0.

includeSlideNotes[source]#
setIncludeSlideNotes(value)[source]#

Sets whether to extract speaker notes from slides.

Parameters:
valuebool

If True, notes are included as narrative text elements.

getIncludeSlideNotes()[source]#

Gets whether to extract speaker notes from slides.

Returns:
bool

True if notes are included as narrative text elements.

class HasTextReaderProperties[source]#

Components that take parameters. This also provides an internal param map to store parameter values attached to the instance.

New in version 1.3.0.

titleLengthSize[source]#
groupBrokenParagraphs[source]#
paragraphSplit[source]#
shortLineWordThreshold[source]#
maxLineCount[source]#
threshold[source]#
setTitleLengthSize(value)[source]#
getTitleLengthSize()[source]#
setGroupBrokenParagraphs(value)[source]#
getGroupBrokenParagraphs()[source]#
setParagraphSplit(value)[source]#
getParagraphSplit()[source]#
setShortLineWordThreshold(value)[source]#
getShortLineWordThreshold()[source]#
setMaxLineCount(value)[source]#
getMaxLineCount()[source]#
setThreshold(value)[source]#
getThreshold()[source]#
class HasChunkerProperties[source]#

Components that take parameters. This also provides an internal param map to store parameter values attached to the instance.

New in version 1.3.0.

chunkingStrategy[source]#
maxCharacters[source]#
newAfterNChars[source]#
overlap[source]#
combineTextUnderNChars[source]#
overlapAll[source]#
setChunkingStrategy(value)[source]#
setMaxCharacters(value)[source]#
setNewAfterNChars(value)[source]#
setOverlap(value)[source]#
setCombineTextUnderNChars(value)[source]#
setOverlapAll(value)[source]#