`sparknlp.annotator.dataframe_optimizer`#

Module Contents#

Classes#

DataFrameOptimizer

Optimizes a Spark DataFrame by repartitioning, optionally caching, and persisting it to disk.

Functions#

toStringDict(value)

toStringDict(value)[source]#

class DataFrameOptimizer[source]#

Optimizes a Spark DataFrame by repartitioning, optionally caching, and persisting it to disk.

This transformer is intended to improve performance for Spark NLP pipelines or when preparing data for export. It allows partition tuning via numPartitions directly, or indirectly using executorCores and numWorkers. The DataFrame can also be persisted in a specified format (csv, json, or parquet) with additional writer options.

Parameters:

executorCoresint, optional: Number of cores per Spark executor (used to compute number of partitions if numPartitions is not set).
numWorkersint, optional: Number of executor nodes (used to compute number of partitions if numPartitions is not set).
numPartitionsint, optional: Target number of partitions for the DataFrame (overrides calculation via cores × workers).
doCachebool, default False: Whether to cache the DataFrame after repartitioning.
persistPathstr, optional: Path to save the DataFrame output (if persistence is enabled).
persistFormatstr, optional: Format to persist the DataFrame in: one of ‘csv’, ‘json’, or ‘parquet’.
outputOptionsdict, optional: Dictionary of options for the DataFrameWriter (e.g., {“compression”: “snappy”} for parquet).

Notes

You must specify either numPartitions, or both executorCores and numWorkers.
Schema is preserved; no columns are modified or removed.

Examples

>>> optimizer = DataFrameOptimizer() \
...     .setExecutorCores(4) \
...     .setNumWorkers(5) \
...     .setDoCache(True) \
...     .setPersistPath("/tmp/out") \
...     .setPersistFormat("parquet") \
...     .setOutputOptions({"compression": "snappy"})

>>> optimized_df = optimizer.transform(input_df)

executorCores[source]#

numWorkers[source]#

numPartitions[source]#

doCache[source]#

persistPath[source]#

persistFormat[source]#

outputOptions[source]#

setExecutorCores(value: int)[source]#: Set the number of executor cores.

setNumWorkers(value: int)[source]#: Set the number of Spark workers.

setNumPartitions(value: int)[source]#: Set the total number of partitions (overrides cores * workers).

setDoCache(value: bool)[source]#: Set whether to cache the DataFrame.

setPersistPath(value: str)[source]#: Set the path where the DataFrame should be persisted.

setPersistFormat(value: str)[source]#: Set the format to persist the DataFrame (parquet, json, csv).

setOutputOptions(value: dict)[source]#: Set additional writer options (e.g. for csv headers).

setParams(**kwargs: Any)[source]#

sparknlp.annotator.dataframe_optimizer#

Module Contents#

Classes#

Functions#

`sparknlp.annotator.dataframe_optimizer`#