Getting Started#

Spark NLP Cheat Sheet#

This cheat sheet can be used as a quick reference on how to set up your environment:

# Install Spark NLP from PyPI
pip install spark-nlp==6.1.0

# Install Spark NLP from Anaconda/Conda
conda install -c johnsnowlabs spark-nlp==6.1.0

# Load Spark NLP with Spark Shell
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.1.0

# Load Spark NLP with PySpark
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.1.0

# Load Spark NLP with Spark Submit
spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.1.0

# Load Spark NLP as external JAR after compiling and building Spark NLP by `sbt assembly`
spark-shell --jar spark-nlp-assembly-6.1.0

Requirements#

Spark NLP is built on top of Apache Spark 3.x. For using Spark NLP you need:

Java 8
Apache Spark (from 2.3.x to 3.3.x)
Python 3.8.x if you are using PySpark 3.x
- NOTE: Since Spark version 3.2, Python 3.6 is deprecated. If you are using this python version, consider sticking to lower versions of Spark.
- For Python 3.6.x and 3.7.x we recommend PySpark 2.3.x or 2.4.x

It is recommended to have basic knowledge of the framework and a working environment before using Spark NLP. Please refer to Spark documentation to get started with Spark.

Installation#

First, let’s make sure the installed java version is Java 8 (Oracle or OpenJDK):

java -version
# openjdk version "1.8.0_292"

Using Conda#

Let’s create a new conda environment to manage all the dependencies there.

Then we can create a new environment sparknlp and install the spark-nlp package with pip:

conda create -n sparknlp python=3.8 -y
conda activate sparknlp
conda install -c johnsnowlabs spark-nlp==6.1.0 pyspark==3.4.4 jupyter

Now you should be ready to create a jupyter notebook with Spark NLP running:

jupyter notebook

Using Virtualenv#

We can also create a Python Virtualenv:

virtualenv sparknlp --python=python3.8 # depends on how your Python installation is set up
source sparknlp/bin/activate
pip install spark-nlp==6.1.0 pyspark==3.4.4 jupyter

Now you should be ready to create a jupyter notebook with Spark NLP running:

jupyter notebook

Starting a Spark NLP Session from Python#

A Spark session for Spark NLP can be created (or retrieved) by using sparknlp.start():

import sparknlp
spark = sparknlp.start()

If you need to manually start SparkSession because you have other configurations and sparknlp.start() is not including them, you can manually start the SparkSession with:

SparkSession.builder \
    .appName("Spark NLP") \
    .master("local[*]") \
    .config("spark.driver.memory", "16G") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.kryoserializer.buffer.max", "2000M") \
    .config("spark.driver.maxResultSize", "0") \
    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:6.1.0") \
    .getOrCreate()