SparkNLP Properties
You can change the following Spark NLP configurations via Spark Configuration:
| Property Name | Default | Meaning |
|---|---|---|
spark.jsl.settings.pretrained.cache_folder |
~/cache_pretrained |
The location to download and extract pretrained Models and Pipelines. By default, it will be in User’s Home directory under cache_pretrained directory |
spark.jsl.settings.storage.cluster_tmp_dir |
hadoop.tmp.dir |
The location to use on a cluster for temporarily files such as unpacking indexes for WordEmbeddings. By default, this locations is the location of hadoop.tmp.dir set via Hadoop configuration for Apache Spark. NOTE: S3 is not supported and it must be local, HDFS, or DBFS |
spark.jsl.settings.annotator.log_folder |
~/annotator_logs |
The location to save logs from annotators during training such as NerDLApproach, ClassifierDLApproach, SentimentDLApproach, MultiClassifierDLApproach, etc. By default, it will be in User’s Home directory under annotator_logs directory |
spark.jsl.settings.aws.credentials.access_key_id |
None |
Your AWS access key to use your S3 bucket to store log files of training models or access tensorflow graphs used in NerDLApproach |
spark.jsl.settings.aws.credentials.secret_access_key |
None |
Your AWS secret access key to use your S3 bucket to store log files of training models or access tensorflow graphs used in NerDLApproach |
spark.jsl.settings.aws.credentials.session_token |
None |
Your AWS MFA session token to use your S3 bucket to store log files of training models or access tensorflow graphs used in NerDLApproach |
spark.jsl.settings.aws.s3_bucket |
None |
Your AWS S3 bucket to store log files of training models or access tensorflow graphs used in NerDLApproach |
spark.jsl.settings.aws.region |
None |
Your AWS region to use your S3 bucket to store log files of training models or access tensorflow graphs used in NerDLApproach |
spark.jsl.settings.onnx.gpuDeviceId |
0 |
Constructs CUDA execution provider options for the specified non-negative device id. |
spark.jsl.settings.onnx.intraOpNumThreads |
6 |
Sets the size of the CPU thread pool used for executing a single graph, if executing on a CPU. |
spark.jsl.settings.onnx.optimizationLevel |
ALL_OPT |
Sets the optimization level of this options object, overriding the old setting. |
spark.jsl.settings.onnx.executionMode |
SEQUENTIAL |
Sets the execution mode of this options object, overriding the old setting. |
How to set Spark NLP Configuration
SparkSession:
You can use .config() during SparkSession creation to set Spark NLP configurations.
from pyspark.sql import SparkSession
spark = SparkSession.builder
.master("local[*]")
.config("spark.driver.memory", "16G")
.config("spark.driver.maxResultSize", "0")
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.kryoserializer.buffer.max", "2000m")
.config("spark.jsl.settings.pretrained.cache_folder", "sample_data/pretrained")
.config("spark.jsl.settings.storage.cluster_tmp_dir", "sample_data/storage")
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:6.4.0")
.getOrCreate()
spark-shell:
spark-shell \
--driver-memory 16g \
--conf spark.driver.maxResultSize=0 \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer
--conf spark.kryoserializer.buffer.max=2000M \
--conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \
--conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.4.0
pyspark:
pyspark \
--driver-memory 16g \
--conf spark.driver.maxResultSize=0 \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer
--conf spark.kryoserializer.buffer.max=2000M \
--conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \
--conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.4.0
Databricks:
On a new cluster or existing one you need to add the following to the Advanced Options -> Spark tab:
spark.kryoserializer.buffer.max 2000M
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.jsl.settings.pretrained.cache_folder dbfs:/PATH_TO_CACHE
spark.jsl.settings.storage.cluster_tmp_dir dbfs:/PATH_TO_STORAGE
spark.jsl.settings.annotator.log_folder dbfs:/PATH_TO_LOGS
NOTE: If this is an existing cluster, after adding new configs or changing existing properties you need to restart it.
Additional Configuration for Databricks
When running Email Reader feature sparknlp.read().email("./email-files") on Databricks, it is necessary to include the following Spark configurations to avoid dependency conflicts:
spark.driver.userClassPathFirst true
spark.executor.userClassPathFirst true
These configurations are required because the Databricks runtime environment includes a bundled version of the com.sun.mail:jakarta.mail library, which conflicts with jakarta.activation.
By setting these properties, the application ensures that the user-provided libraries take precedence over those bundled in the Databricks environment, resolving the dependency conflict.
Databricks Unity Catalog Volumes and pretrained models
Databricks documents that some JVM-based operations do not support reading from or writing to Unity Catalog Volumes through standard /Volumes/... paths. See the official Databricks guidance here:
Databricks documentation: Work with files on Databricks
Spark NLP pretrained downloads rely on JVM-side file operations for download, move, and unzip. Because of this Databricks limitation, Unity Catalog Volumes are not supported as Spark NLP download/cache targets for spark.jsl.settings.pretrained.cache_folder, spark.jsl.settings.storage.cluster_tmp_dir, or spark.jsl.settings.annotator.log_folder.
For Databricks environments that store pretrained models on a Unity Catalog Volume, the supported workaround is to place the model artifacts on the Volume outside the Spark NLP .pretrained() flow and then load them directly with .load(model_path).
Load a model already stored on a Unity Catalog Volume
from sparknlp.annotator import NerDLModel
model_path = "/Volumes/<catalog>/<schema>/<volume>/cache_pretrained/ner_dl_en_2.4.3_2.4_1584624950746"
ner_model = NerDLModel.load(model_path) \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
S3 Integration
Logging:
To configure S3 path for logging while training models. We need to set up AWS credentials as well as an S3 path
spark.conf.set("spark.jsl.settings.annotator.log_folder", "s3://my/s3/path/logs")
spark.conf.set("spark.jsl.settings.aws.credentials.access_key_id", "MY_KEY_ID")
spark.conf.set("spark.jsl.settings.aws.credentials.secret_access_key", "MY_SECRET_ACCESS_KEY")
spark.conf.set("spark.jsl.settings.aws.s3_bucket", "my.bucket")
spark.conf.set("spark.jsl.settings.aws.region", "my-region")
Now you can check the log on your S3 path defined in spark.jsl.settings.annotator.log_folder property. Make sure to use the prefix s3://, otherwise it will use the default configuration.
Tensorflow Graphs:
To reference S3 location for downloading graphs. We need to set up AWS credentials
spark.conf.set("spark.jsl.settings.aws.credentials.access_key_id", "MY_KEY_ID")
spark.conf.set("spark.jsl.settings.aws.credentials.secret_access_key", "MY_SECRET_ACCESS_KEY")
spark.conf.set("spark.jsl.settings.aws.region", "my-region")
MFA Configuration:
In case your AWS account is configured with MFA. You will need first to get temporal credentials and add session token to the configuration as shown in the examples below For logging:
spark.conf.set("spark.jsl.settings.aws.credentials.session_token", "MY_TOKEN")
An example of a bash script that gets temporal AWS credentials can be found here This script requires three arguments:
./aws_tmp_credentials.sh iam_user duration serial_number