sparknlp
#
Subpackages#
sparknlp.annotator
sparknlp.annotator.classifier_dl
sparknlp.annotator.dependency
sparknlp.annotator.embeddings
sparknlp.annotator.er
sparknlp.annotator.keyword_extraction
sparknlp.annotator.ld_dl
sparknlp.annotator.matcher
sparknlp.annotator.ner
sparknlp.annotator.openai
sparknlp.annotator.param
sparknlp.annotator.pos
sparknlp.annotator.sentence
sparknlp.annotator.sentiment
sparknlp.annotator.seq2seq
sparknlp.annotator.spell_check
sparknlp.annotator.token
sparknlp.annotator.ws
sparknlp.annotator.chunk2_doc
sparknlp.annotator.chunker
sparknlp.annotator.date2_chunk
sparknlp.annotator.document_character_text_splitter
sparknlp.annotator.document_normalizer
sparknlp.annotator.document_token_splitter
sparknlp.annotator.graph_extraction
sparknlp.annotator.lemmatizer
sparknlp.annotator.n_gram_generator
sparknlp.annotator.normalizer
sparknlp.annotator.stemmer
sparknlp.annotator.stop_words_cleaner
sparknlp.annotator.token2_chunk
sparknlp.base
sparknlp.base.audio_assembler
sparknlp.base.doc2_chunk
sparknlp.base.document_assembler
sparknlp.base.embeddings_finisher
sparknlp.base.finisher
sparknlp.base.graph_finisher
sparknlp.base.has_recursive_fit
sparknlp.base.has_recursive_transform
sparknlp.base.image_assembler
sparknlp.base.light_pipeline
sparknlp.base.prompt_assembler
sparknlp.base.recursive_pipeline
sparknlp.base.table_assembler
sparknlp.base.token_assembler
sparknlp.common
sparknlp.common.annotator_approach
sparknlp.common.annotator_model
sparknlp.common.annotator_properties
sparknlp.common.coverage_result
sparknlp.common.match_strategy
sparknlp.common.properties
sparknlp.common.read_as
sparknlp.common.recursive_annotator_approach
sparknlp.common.storage
sparknlp.common.utils
sparknlp.internal
sparknlp.logging
sparknlp.pretrained
sparknlp.training
Submodules#
Package Contents#
Functions#
|
Starts a PySpark instance with default parameters for Spark NLP. |
|
Returns the current Spark NLP version. |
- start(gpu=False, apple_silicon=False, aarch64=False, memory='16G', cache_folder='', log_folder='', cluster_tmp_dir='', params=None, real_time_output=False, output_level=1)[source]#
Starts a PySpark instance with default parameters for Spark NLP.
The default parameters would result in the equivalent of:
SparkSession.builder \ .appName("Spark NLP") \ .master("local[*]") \ .config("spark.driver.memory", "16G") \ .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \ .config("spark.kryoserializer.buffer.max", "2000M") \ .config("spark.driver.maxResultSize", "0") \ .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:|release|") \ .getOrCreate()
- Parameters:
- gpubool, optional
Whether to enable GPU acceleration (must be set up correctly), by default False
- apple_siliconbool, optional
Whether to enable Apple Silicon support for macOS
- aarch64bool, optional
Whether to enable Linux Aarch64 support
- memorystr, optional
How much memory to allocate for the Spark driver, by default “16G”
- cache_folderstr, optional
The location to download and extract pretrained Models and Pipelines. If not set, it will be in the users home directory under cache_pretrained.
- log_folderstr, optional
The location to use on a cluster for temporarily files such as unpacking indexes for WordEmbeddings. By default, this locations is the location of hadoop.tmp.dir set via Hadoop configuration for Apache Spark. NOTE: S3 is not supported and it must be local, HDFS, or DBFS.
- paramsdict, optional
Custom parameters to set for the Spark configuration, by default None.
- cluster_tmp_dirstr, optional
The location to save logs from annotators during training. If not set, it will be in the users home directory under annotator_logs.
- real_time_outputbool, optional
Whether to read and print JVM output in real time, by default False
- output_levelint, optional
Output level for logs, by default 1
- Returns:
SparkSession
The initiated Spark session.
Notes
Since Spark version 3.2, Python 3.6 is deprecated. If you are using this python version, consider sticking to lower versions of Spark.