`sparknlp`#

Subpackages#

Submodules#

Package Contents#

Functions#

`start`([gpu, apple_silicon, aarch64, memory, ...])	Starts a PySpark instance with default parameters for Spark NLP.
`read`([params])
`version`()	Returns the current Spark NLP version.

Attributes#

`annotators`
`embeddings`

annotators[source]#

embeddings[source]#

start(gpu=False, apple_silicon=False, aarch64=False, memory='16G', cache_folder='', log_folder='', cluster_tmp_dir='', params=None, real_time_output=False, output_level=1)[source]#

Starts a PySpark instance with default parameters for Spark NLP.

The default parameters would result in the equivalent of:

SparkSession.builder \
    .appName("Spark NLP") \
    .master("local[*]") \
    .config("spark.driver.memory", "16G") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.kryoserializer.buffer.max", "2000M") \
    .config("spark.driver.maxResultSize", "0") \
    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:|release|") \
    .getOrCreate()

Parameters:

gpubool, optional: Whether to enable GPU acceleration (must be set up correctly), by default False
apple_siliconbool, optional: Whether to enable Apple Silicon support for macOS
aarch64bool, optional: Whether to enable Linux Aarch64 support
memorystr, optional: How much memory to allocate for the Spark driver, by default “16G”
cache_folderstr, optional: The location to download and extract pretrained Models and Pipelines. If not set, it will be in the users home directory under cache_pretrained.
log_folderstr, optional: The location to use on a cluster for temporarily files such as unpacking indexes for WordEmbeddings. By default, this locations is the location of hadoop.tmp.dir set via Hadoop configuration for Apache Spark. NOTE: S3 is not supported and it must be local, HDFS, or DBFS.
paramsdict, optional: Custom parameters to set for the Spark configuration, by default None.
cluster_tmp_dirstr, optional: The location to save logs from annotators during training. If not set, it will be in the users home directory under annotator_logs.
real_time_outputbool, optional: Whether to read and print JVM output in real time, by default False
output_levelint, optional: Output level for logs, by default 1

Returns:

SparkSession: The initiated Spark session.

Notes

Since Spark version 3.2, Python 3.6 is deprecated. If you are using this python version, consider sticking to lower versions of Spark.

read(params=None)[source]#

version()[source]#

Returns the current Spark NLP version.