sparknlp.partition.partition_properties
#
Contains classes for partition properties used in reading various document types.
Module Contents#
Classes#
Components that take parameters. This also provides an internal |
|
Components that take parameters. This also provides an internal |
|
Components that take parameters. This also provides an internal |
|
Components that take parameters. This also provides an internal |
|
Components that take parameters. This also provides an internal |
|
Components that take parameters. This also provides an internal |
|
Components that take parameters. This also provides an internal |
|
Components that take parameters. This also provides an internal |
- class HasReaderProperties[source]#
Components that take parameters. This also provides an internal param map to store parameter values attached to the instance.
New in version 1.3.0.
- setContentPath(value: str)[source]#
Sets content path.
- Parameters:
- valuestr
Path to the content source.
- setContentType(value: str)[source]#
Sets content type following MIME specification.
- Parameters:
- valuestr
Content type string (MIME format).
- setStoreContent(value: bool)[source]#
Sets whether to store raw file content.
- Parameters:
- valuebool
True to include raw file content, False otherwise.
- setTitleFontSize(value: int)[source]#
Sets minimum font size for detecting titles.
- Parameters:
- valueint
Minimum font size threshold for title detection.
- setInferTableStructure(value: bool)[source]#
Sets whether to infer table structure.
- Parameters:
- valuebool
True to generate HTML table representation, False otherwise.
- setIncludePageBreaks(value: bool)[source]#
Sets whether to include page break metadata.
- Parameters:
- valuebool
True to detect and tag page breaks, False otherwise.
- setIgnoreExceptions(value: bool)[source]#
Sets whether to ignore exceptions during processing.
- Parameters:
- valuebool
True to ignore exceptions, False otherwise.
- setExplodeDocs(value: bool)[source]#
Sets whether to explode the documents into separate rows.
- Parameters:
- valuebool
True to split documents into multiple rows, False to keep them in one row.
- setFlattenOutput(value)[source]#
Sets whether to flatten the output to plain text with minimal metadata.
ParametersF#
- valuebool
If true, output is flattened to plain text with minimal metadata
- class HasEmailReaderProperties[source]#
Components that take parameters. This also provides an internal param map to store parameter values attached to the instance.
New in version 1.3.0.
- class HasExcelReaderProperties[source]#
Components that take parameters. This also provides an internal param map to store parameter values attached to the instance.
New in version 1.3.0.
- setCellSeparator(value)[source]#
Sets the string used to join cell values in a row when assembling textual output.
- Parameters:
- valuestr
Delimiter used to concatenate cell values.
- getCellSeparator()[source]#
Gets the string used to join cell values in a row when assembling textual output.
- Returns:
- str
Delimiter used to concatenate cell values.
- class HasHTMLReaderProperties[source]#
Components that take parameters. This also provides an internal param map to store parameter values attached to the instance.
New in version 1.3.0.
- setTimeout(value)[source]#
Sets the timeout (in seconds) for reading remote HTML resources.
- Parameters:
- valueint
Timeout in seconds for remote content retrieval.
- class HasPowerPointProperties[source]#
Components that take parameters. This also provides an internal param map to store parameter values attached to the instance.
New in version 1.3.0.
- class HasTextReaderProperties[source]#
Components that take parameters. This also provides an internal param map to store parameter values attached to the instance.
New in version 1.3.0.
- class HasChunkerProperties[source]#
Components that take parameters. This also provides an internal param map to store parameter values attached to the instance.
New in version 1.3.0.
- class HasPdfProperties[source]#
Components that take parameters. This also provides an internal param map to store parameter values attached to the instance.
New in version 1.3.0.
- setPageNumCol(value: str)[source]#
Sets page number output column name.
- Parameters:
- valuestr
Name of the column for page numbers.
- setOriginCol(value: str)[source]#
Sets input column with original file path.
- Parameters:
- valuestr
Column name that stores the file path.
- setPartitionNum(value: int)[source]#
Sets number of partitions.
- Parameters:
- valueint
Number of partitions to use.
- setStoreSplittedPdf(value: bool)[source]#
Sets whether to store byte content of split PDF pages.
- Parameters:
- valuebool
True to store PDF page bytes, False otherwise.
- setSplitPage(value: bool)[source]#
Sets whether to split PDF into pages.
- Parameters:
- valuebool
True to split per page, False otherwise.
- setOnlyPageNum(value: bool)[source]#
Sets whether to extract only page numbers.
- Parameters:
- valuebool
True to extract only page numbers, False otherwise.
- setTextStripper(value: str)[source]#
Sets text stripper type.
- Parameters:
- valuestr
Text stripper type for layout and formatting.
- setSort(value: bool)[source]#
Sets whether to sort content on the page.
- Parameters:
- valuebool
True to sort content, False otherwise.
- setExtractCoordinates(value: bool)[source]#
Sets whether to extract coordinates of text.
- Parameters:
- valuebool
True to extract coordinates, False otherwise.