sparknlp.annotator.dataframe_optimizer
#
Module Contents#
Classes#
Optimizes a Spark DataFrame by repartitioning, optionally caching, and persisting it to disk. |
Functions#
|
- class DataFrameOptimizer[source]#
Optimizes a Spark DataFrame by repartitioning, optionally caching, and persisting it to disk.
This transformer is intended to improve performance for Spark NLP pipelines or when preparing data for export. It allows partition tuning via numPartitions directly, or indirectly using executorCores and numWorkers. The DataFrame can also be persisted in a specified format (csv, json, or parquet) with additional writer options.
- Parameters:
- executorCoresint, optional
Number of cores per Spark executor (used to compute number of partitions if numPartitions is not set).
- numWorkersint, optional
Number of executor nodes (used to compute number of partitions if numPartitions is not set).
- numPartitionsint, optional
Target number of partitions for the DataFrame (overrides calculation via cores × workers).
- doCachebool, default False
Whether to cache the DataFrame after repartitioning.
- persistPathstr, optional
Path to save the DataFrame output (if persistence is enabled).
- persistFormatstr, optional
Format to persist the DataFrame in: one of ‘csv’, ‘json’, or ‘parquet’.
- outputOptionsdict, optional
Dictionary of options for the DataFrameWriter (e.g., {“compression”: “snappy”} for parquet).
Notes
You must specify either numPartitions, or both executorCores and numWorkers.
Schema is preserved; no columns are modified or removed.
Examples
>>> optimizer = DataFrameOptimizer() \ ... .setExecutorCores(4) \ ... .setNumWorkers(5) \ ... .setDoCache(True) \ ... .setPersistPath("/tmp/out") \ ... .setPersistFormat("parquet") \ ... .setOutputOptions({"compression": "snappy"})
>>> optimized_df = optimizer.transform(input_df)