Getting Started#
Spark NLP Cheat Sheet#
This cheat sheet can be used as a quick reference on how to set up your environment:
# Install Spark NLP from PyPI
pip install spark-nlp==5.5.1
# Install Spark NLP from Anaconda/Conda
conda install -c johnsnowlabs spark-nlp==5.5.1
# Load Spark NLP with Spark Shell
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.5.1
# Load Spark NLP with PySpark
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.5.1
# Load Spark NLP with Spark Submit
spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.5.1
# Load Spark NLP as external JAR after compiling and building Spark NLP by `sbt assembly`
spark-shell --jar spark-nlp-assembly-5.5.1
Requirements#
Spark NLP is built on top of Apache Spark 3.x. For using Spark NLP you need:
Java 8
Apache Spark (from
2.3.x
to3.3.x
)Python
3.8.x
if you are using PySpark3.x
NOTE: Since Spark version 3.2, Python 3.6 is deprecated. If you are using this python version, consider sticking to lower versions of Spark.
For Python
3.6.x
and3.7.x
we recommend PySpark2.3.x
or2.4.x
It is recommended to have basic knowledge of the framework and a working environment before using Spark NLP. Please refer to Spark documentation to get started with Spark.
Installation#
First, let’s make sure the installed java version is Java 8 (Oracle or OpenJDK):
java -version
# openjdk version "1.8.0_292"
Using Conda#
Let’s create a new conda environment to manage all the dependencies there.
Then we can create a new environment sparknlp
and install the spark-nlp
package with pip:
conda create -n sparknlp python=3.8 -y
conda activate sparknlp
conda install -c johnsnowlabs spark-nlp==5.5.1 pyspark==3.2.3 jupyter
Now you should be ready to create a jupyter notebook with Spark NLP running:
jupyter notebook
Using Virtualenv#
We can also create a Python Virtualenv:
virtualenv sparknlp --python=python3.8 # depends on how your Python installation is set up
source sparknlp/bin/activate
pip install spark-nlp==5.5.1 pyspark==3.2.3 jupyter
Now you should be ready to create a jupyter notebook with Spark NLP running:
jupyter notebook
Starting a Spark NLP Session from Python#
A Spark session for Spark NLP can be created (or retrieved) by using sparknlp.start()
:
import sparknlp
spark = sparknlp.start()
If you need to manually start SparkSession because you have other configurations and sparknlp.start()
is not including them,
you can manually start the SparkSession with:
SparkSession.builder \
.appName("Spark NLP") \
.master("local[*]") \
.config("spark.driver.memory", "16G") \
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
.config("spark.kryoserializer.buffer.max", "2000M") \
.config("spark.driver.maxResultSize", "0") \
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.5.1") \
.getOrCreate()