sparknlp.annotator.dataframe_optimizer#

Module Contents#

Classes#

DataFrameOptimizer

Optimizes a Spark DataFrame by repartitioning, optionally caching, and persisting it to disk.

Functions#

toStringDict(value)

toStringDict(value)[source]#
class DataFrameOptimizer[source]#

Optimizes a Spark DataFrame by repartitioning, optionally caching, and persisting it to disk.

This transformer is intended to improve performance for Spark NLP pipelines or when preparing data for export. It allows partition tuning via numPartitions directly, or indirectly using executorCores and numWorkers. The DataFrame can also be persisted in a specified format (csv, json, or parquet) with additional writer options.

Parameters:
executorCoresint, optional

Number of cores per Spark executor (used to compute number of partitions if numPartitions is not set).

numWorkersint, optional

Number of executor nodes (used to compute number of partitions if numPartitions is not set).

numPartitionsint, optional

Target number of partitions for the DataFrame (overrides calculation via cores × workers).

doCachebool, default False

Whether to cache the DataFrame after repartitioning.

persistPathstr, optional

Path to save the DataFrame output (if persistence is enabled).

persistFormatstr, optional

Format to persist the DataFrame in: one of ‘csv’, ‘json’, or ‘parquet’.

outputOptionsdict, optional

Dictionary of options for the DataFrameWriter (e.g., {“compression”: “snappy”} for parquet).

Notes

  • You must specify either numPartitions, or both executorCores and numWorkers.

  • Schema is preserved; no columns are modified or removed.

Examples

>>> optimizer = DataFrameOptimizer() \
...     .setExecutorCores(4) \
...     .setNumWorkers(5) \
...     .setDoCache(True) \
...     .setPersistPath("/tmp/out") \
...     .setPersistFormat("parquet") \
...     .setOutputOptions({"compression": "snappy"})
>>> optimized_df = optimizer.transform(input_df)
executorCores[source]#
numWorkers[source]#
numPartitions[source]#
doCache[source]#
persistPath[source]#
persistFormat[source]#
outputOptions[source]#
setExecutorCores(value: int)[source]#

Set the number of executor cores.

setNumWorkers(value: int)[source]#

Set the number of Spark workers.

setNumPartitions(value: int)[source]#

Set the total number of partitions (overrides cores * workers).

setDoCache(value: bool)[source]#

Set whether to cache the DataFrame.

setPersistPath(value: str)[source]#

Set the path where the DataFrame should be persisted.

setPersistFormat(value: str)[source]#

Set the format to persist the DataFrame (parquet, json, csv).

setOutputOptions(value: dict)[source]#

Set additional writer options (e.g. for csv headers).

setParams(**kwargs: Any)[source]#