Spark NLP - Advanced Settings

 

SparkNLP Properties

You can change the following Spark NLP configurations via Spark Configuration:

Property Name Default Meaning
spark.jsl.settings.pretrained.cache_folder ~/cache_pretrained The location to download and extract pretrained Models and Pipelines. By default, it will be in User’s Home directory under cache_pretrained directory
spark.jsl.settings.storage.cluster_tmp_dir hadoop.tmp.dir The location to use on a cluster for temporarily files such as unpacking indexes for WordEmbeddings. By default, this locations is the location of hadoop.tmp.dir set via Hadoop configuration for Apache Spark. NOTE: S3 is not supported and it must be local, HDFS, or DBFS
spark.jsl.settings.annotator.log_folder ~/annotator_logs The location to save logs from annotators during training such as NerDLApproach, ClassifierDLApproach, SentimentDLApproach, MultiClassifierDLApproach, etc. By default, it will be in User’s Home directory under annotator_logs directory
spark.jsl.settings.aws.credentials.access_key_id None Your AWS access key to use your S3 bucket to store log files of training models or access tensorflow graphs used in NerDLApproach
spark.jsl.settings.aws.credentials.secret_access_key None Your AWS secret access key to use your S3 bucket to store log files of training models or access tensorflow graphs used in NerDLApproach
spark.jsl.settings.aws.credentials.session_token None Your AWS MFA session token to use your S3 bucket to store log files of training models or access tensorflow graphs used in NerDLApproach
spark.jsl.settings.aws.s3_bucket None Your AWS S3 bucket to store log files of training models or access tensorflow graphs used in NerDLApproach
spark.jsl.settings.aws.region None Your AWS region to use your S3 bucket to store log files of training models or access tensorflow graphs used in NerDLApproach
spark.jsl.settings.onnx.gpuDeviceId 0 Constructs CUDA execution provider options for the specified non-negative device id.
spark.jsl.settings.onnx.intraOpNumThreads 6 Sets the size of the CPU thread pool used for executing a single graph, if executing on a CPU.
spark.jsl.settings.onnx.optimizationLevel ALL_OPT Sets the optimization level of this options object, overriding the old setting.
spark.jsl.settings.onnx.executionMode SEQUENTIAL Sets the execution mode of this options object, overriding the old setting.

How to set Spark NLP Configuration

SparkSession:

You can use .config() during SparkSession creation to set Spark NLP configurations.

from pyspark.sql import SparkSession

spark = SparkSession.builder
    .master("local[*]")
    .config("spark.driver.memory", "16G")
    .config("spark.driver.maxResultSize", "0")
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
    .config("spark.kryoserializer.buffer.max", "2000m")
    .config("spark.jsl.settings.pretrained.cache_folder", "sample_data/pretrained")
    .config("spark.jsl.settings.storage.cluster_tmp_dir", "sample_data/storage")
    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.5.2")
    .getOrCreate()

spark-shell:

spark-shell \
  --driver-memory 16g \
  --conf spark.driver.maxResultSize=0 \
  --conf spark.serializer=org.apache.spark.serializer.KryoSerializer
  --conf spark.kryoserializer.buffer.max=2000M \
  --conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \
  --conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \
  --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.5.2

pyspark:

pyspark \
  --driver-memory 16g \
  --conf spark.driver.maxResultSize=0 \
  --conf spark.serializer=org.apache.spark.serializer.KryoSerializer
  --conf spark.kryoserializer.buffer.max=2000M \
  --conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \
  --conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \
  --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.5.2

Databricks:

On a new cluster or existing one you need to add the following to the Advanced Options -> Spark tab:

spark.kryoserializer.buffer.max 2000M
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.jsl.settings.pretrained.cache_folder dbfs:/PATH_TO_CACHE
spark.jsl.settings.storage.cluster_tmp_dir dbfs:/PATH_TO_STORAGE
spark.jsl.settings.annotator.log_folder dbfs:/PATH_TO_LOGS

NOTE: If this is an existing cluster, after adding new configs or changing existing properties you need to restart it.

Additional Configuration for Databricks

When running Email Reader feature sparknlp.read().email("./email-files") on Databricks, it is necessary to include the following Spark configurations to avoid dependency conflicts:

spark.driver.userClassPathFirst true
spark.executor.userClassPathFirst true

These configurations are required because the Databricks runtime environment includes a bundled version of the com.sun.mail:jakarta.mail library, which conflicts with jakarta.activation. By setting these properties, the application ensures that the user-provided libraries take precedence over those bundled in the Databricks environment, resolving the dependency conflict.

S3 Integration

Logging:

To configure S3 path for logging while training models. We need to set up AWS credentials as well as an S3 path

spark.conf.set("spark.jsl.settings.annotator.log_folder", "s3://my/s3/path/logs")
spark.conf.set("spark.jsl.settings.aws.credentials.access_key_id", "MY_KEY_ID")
spark.conf.set("spark.jsl.settings.aws.credentials.secret_access_key", "MY_SECRET_ACCESS_KEY")
spark.conf.set("spark.jsl.settings.aws.s3_bucket", "my.bucket")
spark.conf.set("spark.jsl.settings.aws.region", "my-region")

Now you can check the log on your S3 path defined in spark.jsl.settings.annotator.log_folder property. Make sure to use the prefix s3://, otherwise it will use the default configuration.

Tensorflow Graphs:

To reference S3 location for downloading graphs. We need to set up AWS credentials

spark.conf.set("spark.jsl.settings.aws.credentials.access_key_id", "MY_KEY_ID")
spark.conf.set("spark.jsl.settings.aws.credentials.secret_access_key", "MY_SECRET_ACCESS_KEY")
spark.conf.set("spark.jsl.settings.aws.region", "my-region")

MFA Configuration:

In case your AWS account is configured with MFA. You will need first to get temporal credentials and add session token to the configuration as shown in the examples below For logging:

spark.conf.set("spark.jsl.settings.aws.credentials.session_token", "MY_TOKEN")

An example of a bash script that gets temporal AWS credentials can be found here This script requires three arguments:

./aws_tmp_credentials.sh iam_user duration serial_number
Last updated