sparknlp#

Subpackages#

Submodules#

Package Contents#

Functions#

start([gpu, apple_silicon, aarch64, memory, ...])

Starts a PySpark instance with default parameters for Spark NLP.

version()

Returns the current Spark NLP version.

start(gpu=False, apple_silicon=False, aarch64=False, memory='16G', cache_folder='', log_folder='', cluster_tmp_dir='', params=None, real_time_output=False, output_level=1)[source]#

Starts a PySpark instance with default parameters for Spark NLP.

The default parameters would result in the equivalent of:

SparkSession.builder \
    .appName("Spark NLP") \
    .master("local[*]") \
    .config("spark.driver.memory", "16G") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.kryoserializer.buffer.max", "2000M") \
    .config("spark.driver.maxResultSize", "0") \
    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:|release|") \
    .getOrCreate()
Parameters:
gpubool, optional

Whether to enable GPU acceleration (must be set up correctly), by default False

apple_siliconbool, optional

Whether to enable Apple Silicon support for macOS

aarch64bool, optional

Whether to enable Linux Aarch64 support

memorystr, optional

How much memory to allocate for the Spark driver, by default “16G”

cache_folderstr, optional

The location to download and extract pretrained Models and Pipelines. If not set, it will be in the users home directory under cache_pretrained.

log_folderstr, optional

The location to use on a cluster for temporarily files such as unpacking indexes for WordEmbeddings. By default, this locations is the location of hadoop.tmp.dir set via Hadoop configuration for Apache Spark. NOTE: S3 is not supported and it must be local, HDFS, or DBFS.

paramsdict, optional

Custom parameters to set for the Spark configuration, by default None.

cluster_tmp_dirstr, optional

The location to save logs from annotators during training. If not set, it will be in the users home directory under annotator_logs.

real_time_outputbool, optional

Whether to read and print JVM output in real time, by default False

output_levelint, optional

Output level for logs, by default 1

Returns:
SparkSession

The initiated Spark session.

Notes

Since Spark version 3.2, Python 3.6 is deprecated. If you are using this python version, consider sticking to lower versions of Spark.

version()[source]#

Returns the current Spark NLP version.

Returns:
str

The current Spark NLP version.