sparknlp.partition.partition_properties#
Contains classes for partition properties used in reading various document types.
Module Contents#
Classes#
Components that take parameters. This also provides an internal |
|
Components that take parameters. This also provides an internal |
|
Components that take parameters. This also provides an internal |
|
Components that take parameters. This also provides an internal |
|
Components that take parameters. This also provides an internal |
|
Components that take parameters. This also provides an internal |
|
Components that take parameters. This also provides an internal |
|
Components that take parameters. This also provides an internal |
- class HasReaderProperties[source]#
Components that take parameters. This also provides an internal param map to store parameter values attached to the instance.
New in version 1.3.0.
- setContentPath(value: str)[source]#
Sets content path.
- Parameters:
- valuestr
Path to the content source.
- setContentType(value: str)[source]#
Sets content type following MIME specification.
- Parameters:
- valuestr
Content type string (MIME format).
- setStoreContent(value: bool)[source]#
Sets whether to store raw file content.
- Parameters:
- valuebool
True to include raw file content, False otherwise.
- setTitleFontSize(value: int)[source]#
Sets minimum font size for detecting titles.
- Parameters:
- valueint
Minimum font size threshold for title detection.
- setInferTableStructure(value: bool)[source]#
Sets whether to infer table structure.
- Parameters:
- valuebool
True to generate HTML table representation, False otherwise.
- setIncludePageBreaks(value: bool)[source]#
Sets whether to include page break metadata.
- Parameters:
- valuebool
True to detect and tag page breaks, False otherwise.
- setIgnoreExceptions(value: bool)[source]#
Sets whether to ignore exceptions during processing.
- Parameters:
- valuebool
True to ignore exceptions, False otherwise.
- setExplodeDocs(value: bool)[source]#
Sets whether to explode the documents into separate rows.
- Parameters:
- valuebool
True to split documents into multiple rows, False to keep them in one row.
- setFlattenOutput(value)[source]#
Sets whether to flatten the output to plain text with minimal metadata.
ParametersF#
- valuebool
If true, output is flattened to plain text with minimal metadata
- class HasEmailReaderProperties[source]#
Components that take parameters. This also provides an internal param map to store parameter values attached to the instance.
New in version 1.3.0.
- class HasExcelReaderProperties[source]#
Components that take parameters. This also provides an internal param map to store parameter values attached to the instance.
New in version 1.3.0.
- setCellSeparator(value)[source]#
Sets the string used to join cell values in a row when assembling textual output.
- Parameters:
- valuestr
Delimiter used to concatenate cell values.
- getCellSeparator()[source]#
Gets the string used to join cell values in a row when assembling textual output.
- Returns:
- str
Delimiter used to concatenate cell values.
- class HasHTMLReaderProperties[source]#
Components that take parameters. This also provides an internal param map to store parameter values attached to the instance.
New in version 1.3.0.
- setTimeout(value)[source]#
Sets the timeout (in seconds) for reading remote HTML resources.
- Parameters:
- valueint
Timeout in seconds for remote content retrieval.
- class HasPowerPointProperties[source]#
Components that take parameters. This also provides an internal param map to store parameter values attached to the instance.
New in version 1.3.0.
- class HasTextReaderProperties[source]#
Components that take parameters. This also provides an internal param map to store parameter values attached to the instance.
New in version 1.3.0.
- setTitleLengthSize(value)[source]#
Set the maximum character length used to identify title blocks.
- Parameters:
- valueint
Maximum number of characters a text block can have to be considered a title.
- Returns:
- self
The instance with updated titleLengthSize parameter.
- getTitleLengthSize()[source]#
Get the configured maximum title length.
- Returns:
- int
The maximum character length used to detect title blocks.
- setGroupBrokenParagraphs(value)[source]#
Enable or disable grouping of broken paragraphs.
- Parameters:
- valuebool
True to merge fragmented lines into paragraphs, False to leave lines as-is.
- Returns:
- self
The instance with updated groupBrokenParagraphs parameter.
- getGroupBrokenParagraphs()[source]#
Get whether broken paragraph grouping is enabled.
- Returns:
- bool
True if grouping of broken paragraphs is enabled, False otherwise.
- setParagraphSplit(value)[source]#
Set the regex pattern used to split paragraphs when grouping broken paragraphs.
- Parameters:
- valuestr
Regular expression string used to detect paragraph boundaries.
- Returns:
- self
The instance with updated paragraphSplit parameter.
- getParagraphSplit()[source]#
Get the paragraph-splitting regex pattern.
- Returns:
- str
The regex pattern used to detect paragraph boundaries.
- setShortLineWordThreshold(value)[source]#
Set the maximum word count for a line to be considered short.
- Parameters:
- valueint
Number of words under which a line is considered ‘short’.
- Returns:
- self
The instance with updated shortLineWordThreshold parameter.
- getShortLineWordThreshold()[source]#
Get the short line word threshold.
- Returns:
- int
Word count threshold for short lines used in paragraph grouping.
- setMaxLineCount(value)[source]#
Set the maximum number of lines to inspect when estimating paragraph layout.
- Parameters:
- valueint
Maximum number of lines to evaluate for layout heuristics.
- Returns:
- self
The instance with updated maxLineCount parameter.
- getMaxLineCount()[source]#
Get the maximum number of lines used for layout heuristics.
- Returns:
- int
The configured maximum number of lines to consider.
- setThreshold(value)[source]#
Set the empty-line ratio threshold for paragraph grouping decision.
- Parameters:
- valuefloat
Ratio (0.0-1.0) of empty lines used to switch grouping strategies.
- Returns:
- self
The instance with updated threshold parameter.
- getThreshold()[source]#
Get the configured empty-line threshold ratio.
- Returns:
- float
The ratio used to decide paragraph grouping strategy.
- setExtractTagAttributes(attributes: list[str])[source]#
Specify which tag attributes should have their values extracted as text when parsing tag-based formats (e.g., HTML or XML).
- Parameters:
attributes – list of attribute names to extract
- Returns:
this instance with the updated extractTagAttributes parameter
- class HasChunkerProperties[source]#
Components that take parameters. This also provides an internal param map to store parameter values attached to the instance.
New in version 1.3.0.
- class HasPdfProperties[source]#
Components that take parameters. This also provides an internal param map to store parameter values attached to the instance.
New in version 1.3.0.
- setPageNumCol(value: str)[source]#
Sets page number output column name.
- Parameters:
- valuestr
Name of the column for page numbers.
- setOriginCol(value: str)[source]#
Sets input column with original file path.
- Parameters:
- valuestr
Column name that stores the file path.
- setPartitionNum(value: int)[source]#
Sets number of partitions.
- Parameters:
- valueint
Number of partitions to use.
- setStoreSplittedPdf(value: bool)[source]#
Sets whether to store byte content of split PDF pages.
- Parameters:
- valuebool
True to store PDF page bytes, False otherwise.
- setSplitPage(value: bool)[source]#
Sets whether to split PDF into pages.
- Parameters:
- valuebool
True to split per page, False otherwise.
- setOnlyPageNum(value: bool)[source]#
Sets whether to extract only page numbers.
- Parameters:
- valuebool
True to extract only page numbers, False otherwise.
- setTextStripper(value: str)[source]#
Sets text stripper type.
- Parameters:
- valuestr
Text stripper type for layout and formatting.
- setSort(value: bool)[source]#
Sets whether to sort content on the page.
- Parameters:
- valuebool
True to sort content, False otherwise.
- setExtractCoordinates(value: bool)[source]#
Sets whether to extract coordinates of text.
- Parameters:
- valuebool
True to extract coordinates, False otherwise.