Spark NLP - Advanced Settings

SparkNLP Properties

You can change the following Spark NLP configurations via Spark Configuration:

Property Name	Default	Meaning
`spark.jsl.settings.pretrained.cache_folder`	`~/cache_pretrained`	The location to download and extract pretrained `Models` and `Pipelines`. By default, it will be in User’s Home directory under `cache_pretrained` directory
`spark.jsl.settings.storage.cluster_tmp_dir`	`hadoop.tmp.dir`	The location to use on a cluster for temporarily files such as unpacking indexes for WordEmbeddings. By default, this locations is the location of `hadoop.tmp.dir` set via Hadoop configuration for Apache Spark. NOTE: `S3` is not supported and it must be local, HDFS, or DBFS
`spark.jsl.settings.annotator.log_folder`	`~/annotator_logs`	The location to save logs from annotators during training such as `NerDLApproach`, `ClassifierDLApproach`, `SentimentDLApproach`, `MultiClassifierDLApproach`, etc. By default, it will be in User’s Home directory under `annotator_logs` directory
`spark.jsl.settings.aws.credentials.access_key_id`	`None`	Your AWS access key to use your S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach`
`spark.jsl.settings.aws.credentials.secret_access_key`	`None`	Your AWS secret access key to use your S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach`
`spark.jsl.settings.aws.credentials.session_token`	`None`	Your AWS MFA session token to use your S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach`
`spark.jsl.settings.aws.s3_bucket`	`None`	Your AWS S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach`
`spark.jsl.settings.aws.region`	`None`	Your AWS region to use your S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach`
`spark.jsl.settings.onnx.gpuDeviceId`	`0`	Constructs CUDA execution provider options for the specified non-negative device id.
`spark.jsl.settings.onnx.intraOpNumThreads`	`6`	Sets the size of the CPU thread pool used for executing a single graph, if executing on a CPU.
`spark.jsl.settings.onnx.optimizationLevel`	`ALL_OPT`	Sets the optimization level of this options object, overriding the old setting.
`spark.jsl.settings.onnx.executionMode`	`SEQUENTIAL`	Sets the execution mode of this options object, overriding the old setting.

How to set Spark NLP Configuration

SparkSession:

You can use .config() during SparkSession creation to set Spark NLP configurations.

from pyspark.sql import SparkSession

spark = SparkSession.builder
    .master("local[*]")
    .config("spark.driver.memory", "16G")
    .config("spark.driver.maxResultSize", "0")
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
    .config("spark.kryoserializer.buffer.max", "2000m")
    .config("spark.jsl.settings.pretrained.cache_folder", "sample_data/pretrained")
    .config("spark.jsl.settings.storage.cluster_tmp_dir", "sample_data/storage")
    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:6.1.2")
    .getOrCreate()

spark-shell:

spark-shell \
  --driver-memory 16g \
  --conf spark.driver.maxResultSize=0 \
  --conf spark.serializer=org.apache.spark.serializer.KryoSerializer
  --conf spark.kryoserializer.buffer.max=2000M \
  --conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \
  --conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \
  --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.1.2

pyspark:

pyspark \
  --driver-memory 16g \
  --conf spark.driver.maxResultSize=0 \
  --conf spark.serializer=org.apache.spark.serializer.KryoSerializer
  --conf spark.kryoserializer.buffer.max=2000M \
  --conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \
  --conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \
  --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.1.2

Databricks:

On a new cluster or existing one you need to add the following to the Advanced Options -> Spark tab:

spark.kryoserializer.buffer.max 2000M
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.jsl.settings.pretrained.cache_folder dbfs:/PATH_TO_CACHE
spark.jsl.settings.storage.cluster_tmp_dir dbfs:/PATH_TO_STORAGE
spark.jsl.settings.annotator.log_folder dbfs:/PATH_TO_LOGS

NOTE: If this is an existing cluster, after adding new configs or changing existing properties you need to restart it.

Additional Configuration for Databricks

When running Email Reader feature sparknlp.read().email("./email-files") on Databricks, it is necessary to include the following Spark configurations to avoid dependency conflicts:

spark.driver.userClassPathFirst true
spark.executor.userClassPathFirst true

These configurations are required because the Databricks runtime environment includes a bundled version of the com.sun.mail:jakarta.mail library, which conflicts with jakarta.activation. By setting these properties, the application ensures that the user-provided libraries take precedence over those bundled in the Databricks environment, resolving the dependency conflict.

S3 Integration

Logging:

To configure S3 path for logging while training models. We need to set up AWS credentials as well as an S3 path

spark.conf.set("spark.jsl.settings.annotator.log_folder", "s3://my/s3/path/logs")
spark.conf.set("spark.jsl.settings.aws.credentials.access_key_id", "MY_KEY_ID")
spark.conf.set("spark.jsl.settings.aws.credentials.secret_access_key", "MY_SECRET_ACCESS_KEY")
spark.conf.set("spark.jsl.settings.aws.s3_bucket", "my.bucket")
spark.conf.set("spark.jsl.settings.aws.region", "my-region")

Now you can check the log on your S3 path defined in spark.jsl.settings.annotator.log_folder property. Make sure to use the prefix s3://, otherwise it will use the default configuration.

Tensorflow Graphs:

To reference S3 location for downloading graphs. We need to set up AWS credentials

spark.conf.set("spark.jsl.settings.aws.credentials.access_key_id", "MY_KEY_ID")
spark.conf.set("spark.jsl.settings.aws.credentials.secret_access_key", "MY_SECRET_ACCESS_KEY")
spark.conf.set("spark.jsl.settings.aws.region", "my-region")

MFA Configuration:

In case your AWS account is configured with MFA. You will need first to get temporal credentials and add session token to the configuration as shown in the examples below For logging:

spark.conf.set("spark.jsl.settings.aws.credentials.session_token", "MY_TOKEN")

An example of a bash script that gets temporal AWS credentials can be found here This script requires three arguments:

./aws_tmp_credentials.sh iam_user duration serial_number

PREVIOUSInstall Spark NLP

NEXTFeatures