sparknlp#
Subpackages#
sparknlp.annotatorsparknlp.annotator.audiosparknlp.annotator.classifier_dlsparknlp.annotator.cleanerssparknlp.annotator.corefsparknlp.annotator.cvsparknlp.annotator.dependencysparknlp.annotator.embeddingssparknlp.annotator.ersparknlp.annotator.keyword_extractionsparknlp.annotator.ld_dlsparknlp.annotator.matchersparknlp.annotator.nersparknlp.annotator.openaisparknlp.annotator.paramsparknlp.annotator.possparknlp.annotator.sentencesparknlp.annotator.sentimentsparknlp.annotator.seq2seqsparknlp.annotator.similaritysparknlp.annotator.spell_checksparknlp.annotator.tokensparknlp.annotator.wssparknlp.annotator.chunk2_docsparknlp.annotator.chunkersparknlp.annotator.dataframe_optimizersparknlp.annotator.date2_chunksparknlp.annotator.document_character_text_splittersparknlp.annotator.document_normalizersparknlp.annotator.document_token_splittersparknlp.annotator.document_token_splitter_testsparknlp.annotator.graph_extractionsparknlp.annotator.lemmatizersparknlp.annotator.n_gram_generatorsparknlp.annotator.normalizersparknlp.annotator.stemmersparknlp.annotator.stop_words_cleanersparknlp.annotator.tf_ner_dl_graph_buildersparknlp.annotator.token2_chunk
sparknlp.basesparknlp.base.audio_assemblersparknlp.base.doc2_chunksparknlp.base.document_assemblersparknlp.base.embeddings_finishersparknlp.base.finishersparknlp.base.gguf_ranking_finishersparknlp.base.graph_finishersparknlp.base.has_recursive_fitsparknlp.base.has_recursive_transformsparknlp.base.image_assemblersparknlp.base.light_pipelinesparknlp.base.multi_document_assemblersparknlp.base.prompt_assemblersparknlp.base.recursive_pipelinesparknlp.base.table_assemblersparknlp.base.token_assembler
sparknlp.commonsparknlp.common.annotator_approachsparknlp.common.annotator_modelsparknlp.common.annotator_propertiessparknlp.common.annotator_typesparknlp.common.coverage_resultsparknlp.common.match_strategysparknlp.common.propertiessparknlp.common.read_assparknlp.common.recursive_annotator_approachsparknlp.common.storagesparknlp.common.utils
sparknlp.internalsparknlp.loggingsparknlp.partitionsparknlp.pretrainedsparknlp.readersparknlp.training
Submodules#
Package Contents#
Functions#
|
Starts a PySpark instance with default parameters for Spark NLP. |
|
|
|
Returns the current Spark NLP version. |
Attributes#
- start(gpu=False, apple_silicon=False, aarch64=False, memory='16G', cache_folder='', log_folder='', cluster_tmp_dir='', params=None, real_time_output=False, output_level=1)[source]#
Starts a PySpark instance with default parameters for Spark NLP.
The default parameters would result in the equivalent of:
SparkSession.builder \ .appName("Spark NLP") \ .master("local[*]") \ .config("spark.driver.memory", "16G") \ .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \ .config("spark.kryoserializer.buffer.max", "2000M") \ .config("spark.driver.maxResultSize", "0") \ .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:|release|") \ .getOrCreate()
- Parameters:
- gpubool, optional
Whether to enable GPU acceleration (must be set up correctly), by default False
- apple_siliconbool, optional
Whether to enable Apple Silicon support for macOS
- aarch64bool, optional
Whether to enable Linux Aarch64 support
- memorystr, optional
How much memory to allocate for the Spark driver, by default “16G”
- cache_folderstr, optional
The location to download and extract pretrained Models and Pipelines. If not set, it will be in the users home directory under cache_pretrained.
- log_folderstr, optional
The location to use on a cluster for temporarily files such as unpacking indexes for WordEmbeddings. By default, this locations is the location of hadoop.tmp.dir set via Hadoop configuration for Apache Spark. NOTE: S3 is not supported and it must be local, HDFS, or DBFS.
- paramsdict, optional
Custom parameters to set for the Spark configuration, by default None.
- cluster_tmp_dirstr, optional
The location to save logs from annotators during training. If not set, it will be in the users home directory under annotator_logs.
- real_time_outputbool, optional
Whether to read and print JVM output in real time, by default False
- output_levelint, optional
Output level for logs, by default 1
- Returns:
SparkSessionThe initiated Spark session.
Notes
Since Spark version 3.2, Python 3.6 is deprecated. If you are using this python version, consider sticking to lower versions of Spark.