Spark NLP - Annotators

How to read this section

All annotators in Spark NLP share a common interface, this is:

Annotation: Annotation(annotatorType, begin, end, result, meta-data, embeddings)
AnnotatorType: some annotators share a type. This is not only figurative, but also tells about the structure of the metadata map in the Annotation. This is the one referred in the input and output of annotators.
Inputs: Represents how many and which annotator types are expected in setInputCols(). These are column names of output of other annotators in the DataFrames.
Output Represents the type of the output in the column setOutputCol().

There are two types of Annotators:

Approach: AnnotatorApproach extend Estimators, which are meant to be trained through fit()
Model: AnnotatorModel extend from Transformers, which are meant to transform DataFrames through transform()

Model suffix is explicitly stated when the annotator is the result of a training process. Some annotators, such as Tokenizer are transformers, but do not contain the word Model since they are not trained annotators.

Model annotators have a pretrained() on it’s static object, to retrieve the public pre-trained version of a model.

pretrained(name, language, extra_location) -> by default, pre-trained will bring a default model, sometimes we offer more than one model, in this case, you may have to use name, language or extra location to download them.

Available Annotators

Annotator	Description	Version
AutoGGUFEmbeddings	Annotator that uses the llama.cpp library to generate text embeddings with large language models.	Opensource
AutoGGUFModel	Annotator that uses the llama.cpp library to generate text completions with large language models.	Opensource
AutoGGUFReranker	Annotator that uses the llama.cpp library to rerank text documents based on their relevance to a given query using GGUF-format reranking models.	Opensource
AutoGGUFVisionModel	Multimodal annotator that uses the llama.cpp library to generate text completions with large language models.	Opensource
BGEEmbeddings	Sentence embeddings using BGE.	Opensource
BigTextMatcher	Annotator to match exact phrases (by token) provided in a file against a Document.	Opensource
Chunk2Doc	Converts a `CHUNK` type column back into `DOCUMENT`. Useful when trying to re-tokenize or do further analysis on a `CHUNK` result.	Opensource
ChunkEmbeddings	This annotator utilizes WordEmbeddings, BertEmbeddings etc. to generate chunk embeddings from either Chunker, NGramGenerator, or NerConverter outputs.	Opensource
ChunkTokenizer	Tokenizes and flattens extracted NER chunks.	Opensource
Chunker	This annotator matches a pattern of part-of-speech tags in order to return meaningful phrases from document.	Opensource
ClassifierDL	ClassifierDL for generic Multi-class Text Classification.	Opensource
ContextSpellChecker	Implements a deep-learning based Noisy Channel Model Spell Algorithm.	Opensource
Date2Chunk	Converts `DATE` type Annotations to `CHUNK` type.	Opensource
DateMatcher	Matches standard date formats into a provided format.	Opensource
DependencyParser	Unlabeled parser that finds a grammatical relation between two words in a sentence.	Opensource
Doc2Chunk	Converts `DOCUMENT` type annotations into `CHUNK` type with the contents of a `chunkCol`.	Opensource
Doc2Vec	Word2Vec model that creates vector representations of words in a text corpus.	Opensource
DocumentAssembler	Prepares data into a format that is processable by Spark NLP. This is the entry point for every Spark NLP pipeline.	Opensource
DocumentCharacterTextSplitter	Annotator which splits large documents into chunks of roughly given size.	Opensource
DocumentNormalizer	Annotator which normalizes raw text from tagged text, e.g. scraped web pages or xml documents, from document type columns into Sentence.	Opensource
DocumentSimilarityRanker	Annotator that uses LSH techniques present in Spark ML lib to execute approximate nearest neighbors search on top of sentence embeddings.	Opensource
DocumentTokenSplitter	Annotator that splits large documents into smaller documents based on the number of tokens in the text.	Opensource
EntityRuler	Fits an Annotator to match exact strings or regex patterns provided in a file against a Document and assigns them an named entity.	Opensource
EmbeddingsFinisher	Extracts embeddings from Annotations into a more easily usable form.	Opensource
Finisher	Converts annotation results into a format that easier to use. It is useful to extract the results from Spark NLP Pipelines.	Opensource
GraphExtraction	Extracts a dependency graph between entities.	Opensource
GraphFinisher	Helper class to convert the knowledge graph from GraphExtraction into a generic format, such as RDF.	Opensource
ImageAssembler	Prepares images read by Spark into a format that is processable by Spark NLP.	Opensource
LanguageDetectorDL	Language Identification and Detection by using CNN and RNN architectures in TensorFlow.	Opensource
Lemmatizer	Finds lemmas out of words with the objective of returning a base dictionary word.	Opensource
MultiClassifierDL	Multi-label Text Classification.	Opensource
MultiDateMatcher	Matches standard date formats into a provided format.	Opensource
MultiDocumentAssembler	Prepares data into a format that is processable by Spark NLP.	Opensource
NGramGenerator	A feature transformer that converts the input array of strings (annotatorType TOKEN) into an array of n-grams (annotatorType CHUNK).	Opensource
NerConverter	Converts a IOB or IOB2 representation of NER to a user-friendly one, by associating the tokens of recognized entities and their label.	Opensource
NerCrf	Extracts Named Entities based on a CRF Model.	Opensource
NerDL	This Named Entity recognition annotator is a generic NER model based on Neural Networks.	Opensource
NerOverwriter	Overwrites entities of specified strings.	Opensource
Normalizer	Removes all dirty characters from text following a regex pattern and transforms words based on a provided dictionary.	Opensource
NorvigSweeting Spellchecker	Retrieves tokens and makes corrections automatically if not found in an English dictionary.	Opensource
POSTagger (Part of speech tagger)	Averaged Perceptron model to tag words part-of-speech.	Opensource
PromptAssembler	Assembles a sequence of messages into a single string using a template.	Opensource
RecursiveTokenizer	Tokenizes raw text recursively based on a handful of definable rules.	Opensource
RegexMatcher	Uses rules to match a set of regular expressions and associate them with a provided identifier.	Opensource
RegexTokenizer	A tokenizer that splits text by a regex pattern.	Opensource
SentenceDetector	Annotator that detects sentence boundaries using regular expressions.	Opensource
SentenceDetectorDL	Detects sentence boundaries using a deep learning approach.	Opensource
SentenceEmbeddings	Converts the results from WordEmbeddings, BertEmbeddings, or ElmoEmbeddings into sentence or document embeddings by either summing up or averaging all the word embeddings in a sentence or a document (depending on the inputCols).	Opensource
SentimentDL	Annotator for multi-class sentiment analysis.	Opensource
SentimentDetector	Rule based sentiment detector, which calculates a score based on predefined keywords.	Opensource
Stemmer	Returns hard-stems out of words with the objective of retrieving the meaningful part of the word.	Opensource
StopWordsCleaner	This annotator takes a sequence of strings (e.g. the output of a Tokenizer, Normalizer, Lemmatizer, and Stemmer) and drops all the stop words from the input sequences.	Opensource
SymmetricDelete Spellchecker	Symmetric Delete spelling correction algorithm.	Opensource
TextMatcher	Matches exact phrases (by token) provided in a file against a Document.	Opensource
Token2Chunk	Converts `TOKEN` type Annotations to `CHUNK` type.	Opensource
TokenAssembler	This transformer reconstructs a DOCUMENT type annotation from tokens, usually after these have been normalized, lemmatized, normalized, spell checked, etc, in order to use this document annotation in further annotators.	Opensource
Tokenizer	Tokenizes raw text into word pieces, tokens. Identifies tokens with tokenization open standards. A few rules will help customizing it if defaults do not fit user needs.	Opensource
TypedDependencyParser	Labeled parser that finds a grammatical relation between two words in a sentence.	Opensource
ViveknSentiment	Sentiment analyser inspired by the algorithm by Vivek Narayanan.	Opensource
WordEmbeddings	Word Embeddings lookup annotator that maps tokens to vectors.	Opensource
Word2Vec	Word2Vec model that creates vector representations of words in a text corpus.	Opensource
WordSegmenter	Tokenizes non-english or non-whitespace separated texts.	Opensource
YakeKeywordExtraction	Unsupervised, Corpus-Independent, Domain and Language-Independent and Single-Document keyword extraction.	Opensource

Available Transformers

Additionally, these transformers are available.

Transformer	Description	Version
AlbertEmbeddings	ALBERT: A Lite BERT for Self-supervised Learning of Language Representations	Opensource
AlbertForQuestionAnswering	AlbertForQuestionAnswering can load ALBERT Models with a span classification head on top for extractive question-answering tasks like SQuAD.	Opensource
AlbertForTokenClassification	AlbertForTokenClassification can load ALBERT Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.	Opensource
AlbertForSequenceClassification	AlbertForSequenceClassification can load ALBERT Models with sequence classification/regression head on top e.g. for multi-class document classification tasks.	Opensource
BartForZeroShotClassification	BartForZeroShotClassification using a `ModelForSequenceClassification` trained on NLI (natural language inference) tasks.	Opensource
BartTransformer	BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension Transformer	Opensource
BertForQuestionAnswering	BertForQuestionAnswering can load Bert Models with a span classification head on top for extractive question-answering tasks like SQuAD.	Opensource
BertForSequenceClassification	Bert Models with sequence classification/regression head on top.	Opensource
BertForTokenClassification	BertForTokenClassification can load Bert Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.	Opensource
BertForZeroShotClassification	BertForZeroShotClassification using a ModelForSequenceClassification trained on NLI (natural language inference) tasks.	Opensource
BertSentenceEmbeddings	Sentence-level embeddings using BERT. BERT (Bidirectional Encoder Representations from Transformers) provides dense vector representations for natural language by using a deep, pre-trained neural network with the Transformer architecture.	Opensource
CamemBertEmbeddings	CamemBert is based on Facebook’s RoBERTa model released in 2019.	Opensource
CamemBertForQuestionAnswering	CamemBertForQuestionAnswering can load CamemBERT Models with a span classification head on top for extractive question-answering tasks like SQuAD	Opensource
CamemBertForSequenceClassification	amemBertForSequenceClassification can load CamemBERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks.	Opensource
CamemBertForTokenClassification	CamemBertForTokenClassification can load CamemBERT Models with a token classification head on top	Opensource
CLIPForZeroShotClassification	Zero Shot Image Classifier based on CLIP	Opensource
ConvNextForImageClassification	ConvNextForImageClassification is an image classifier based on ConvNet models	Opensource
DeBertaEmbeddings	DeBERTa builds on RoBERTa with disentangled attention and enhanced mask decoder training with half of the data used in RoBERTa.	Opensource
DeBertaForQuestionAnswering	DeBertaForQuestionAnswering can load DeBERTa Models with a span classification head on top for extractive question-answering tasks like SQuAD.	Opensource
DeBertaForSequenceClassification	DeBertaForSequenceClassification can load DeBerta v2 & v3 Models with sequence classification/regression head on top.	Opensource
DeBertaForTokenClassification	DeBertaForTokenClassification can load DeBERTA Models v2 and v3 with a token classification head on top.	Opensource
DistilBertEmbeddings	DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base.	Opensource
DistilBertForQuestionAnswering	DistilBertForQuestionAnswering can load DistilBert Models with a span classification head on top for extractive question-answering tasks like SQuAD.	Opensource
DistilBertForSequenceClassification	DistilBertForSequenceClassification can load DistilBERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks.	Opensource
DistilBertForTokenClassification	DistilBertForTokenClassification can load DistilBERT Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.	Opensource
DistilBertForZeroShotClassification	DistilBertForZeroShotClassification using a ModelForSequenceClassification trained on NLI (natural language inference) tasks.	Opensource
E5Embeddings	Sentence embeddings using E5, an instruction-finetuned text embedding model that can generate text embeddings tailored to any task.	Opensource
MiniLMEmbeddings	Sentence embeddings using MiniLM, a lightweight and efficient sentence embedding model that can generate text embeddings for various NLP tasks.	Opensource
ElmoEmbeddings	Word embeddings from ELMo (Embeddings from Language Models), a language model trained on the 1 Billion Word Benchmark.	Opensource
GPT2Transformer	GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset of 8 million web pages.	Opensource
HubertForCTC	Hubert Model with a language modeling head on top for Connectionist Temporal Classification (CTC).	Opensource
InstructorEmbeddings	Sentence embeddings using INSTRUCTOR.	Opensource
LongformerEmbeddings	Longformer is a BERT-like model started from the RoBERTa checkpoint and pretrained for MLM on long documents.	Opensource
LongformerForQuestionAnswering	LongformerForQuestionAnswering can load Longformer Models with a span classification head on top for extractive question-answering tasks like SQuAD.	Opensource
LongformerForSequenceClassification	LongformerForSequenceClassification can load Longformer Models with sequence classification/regression head on top e.g. for multi-class document classification tasks.	Opensource
LongformerForTokenClassification	LongformerForTokenClassification can load Longformer Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.	Opensource
MarianTransformer	Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies.	Opensource
MPNetEmbeddings	Sentence embeddings using MPNet.	Opensource
MPNetForQuestionAnswering	MPNet Models with a span classification head on top for extractive question-answering tasks like SQuAD.	Opensource
MPNetForSequenceClassification	MPNet Models with sequence classification/regression head on top e.g. for multi-class document classification tasks.	Opensource
OpenAICompletion	Transformer that makes a request for OpenAI Completion API for each executor.	Opensource
RoBertaEmbeddings	RoBERTa: A Robustly Optimized BERT Pretraining Approach	Opensource
RoBertaForQuestionAnswering	RoBertaForQuestionAnswering can load RoBERTa Models with a span classification head on top for extractive question-answering tasks like SQuAD.	Opensource
RoBertaForSequenceClassification	RoBertaForSequenceClassification can load RoBERTa Models with sequence classification/regression head on top e.g. for multi-class document classification tasks.	Opensource
RoBertaForTokenClassification	RoBertaForTokenClassification can load RoBERTa Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.	Opensource
RoBertaForZeroShotClassification	RoBertaForZeroShotClassification using a `ModelForSequenceClassification` trained on NLI (natural language inference) tasks.	Opensource
RoBertaSentenceEmbeddings	Sentence-level embeddings using RoBERTa.	Opensource
SpanBertCoref	A coreference resolution model based on SpanBert.	Opensource
SwinForImageClassification	SwinImageClassification is an image classifier based on Swin.	Opensource
T5Transformer	T5 reconsiders all NLP tasks into a unified text-to-text-format where the input and output are always text strings, in contrast to BERT-style models that can only output either a class label or a span of the input.	Opensource
TapasForQuestionAnswering	TapasForQuestionAnswering is an implementation of TaPas - a BERT-based model specifically designed for answering questions about tabular data.	Opensource
UAEEmbeddings	Sentence embeddings using Universal AnglE Embedding (UAE).	Opensource
UniversalSentenceEncoder	The Universal Sentence Encoder encodes text into high dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks.	Opensource
VisionEncoderDecoderForImageCaptioning	VisionEncoderDecoder model that converts images into text captions.	Opensource
ViTForImageClassification	Vision Transformer (ViT) for image classification.	Opensource
Wav2Vec2ForCTC	Wav2Vec2 Model with a language modeling head on top for Connectionist Temporal Classification (CTC).	Opensource
WhisperForCTC	Whisper Model with a language modeling head on top for Connectionist Temporal Classification (CTC).	Opensource
XlmRoBertaEmbeddings	XlmRoBerta is a large multi-lingual language model, trained on 2.5TB of filtered CommonCrawl	Opensource
XlmRoBertaForQuestionAnswering	XlmRoBertaForQuestionAnswering can load XLM-RoBERTa Models with a span classification head on top for extractive question-answering tasks like SQuAD.	Opensource
XlmRoBertaForSequenceClassification	XlmRoBertaForSequenceClassification can load XLM-RoBERTa Models with sequence classification/regression head on top e.g. for multi-class document classification tasks.	Opensource
XlmRoBertaForTokenClassification	XlmRoBertaForTokenClassification can load XLM-RoBERTa Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.	Opensource
XlmRoBertaForZeroShotClassification	XlmRoBertaForZeroShotClassification using a `ModelForSequenceClassification` trained on NLI (natural language inference) tasks.	Opensource
XlmRoBertaSentenceEmbeddings	Sentence-level embeddings using XLM-RoBERTa.	Opensource
XlnetEmbeddings	XLNet is a new unsupervised language representation learning method based on a novel generalized permutation language modeling objective.	Opensource
XlnetForTokenClassification	XlnetForTokenClassification can load XLNet Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.	Opensource
XlnetForSequenceClassification	XlnetForSequenceClassification can load XLNet Models with sequence classification/regression head on top e.g. for multi-class document classification tasks.	Opensource
ZeroShotNer	ZeroShotNerModel implements zero shot named entity recognition by utilizing RoBERTa transformer models fine tuned on a question answering task.	Opensource

AutoGGUFEmbeddings

Annotator that uses the llama.cpp library to generate text embeddings with large language models.

The type of embedding pooling can be set with the setPoolingType method. The default is "MEAN". The available options are "NONE", "MEAN", "CLS", and "LAST".

If the parameters are not set, the annotator will default to use the parameters provided by the model.

Pretrained models can be loaded with pretrained of the companion object:

val autoGGUFEmbeddings = AutoGGUFEmbeddings.pretrained()
  .setInputCols("document")
  .setOutputCol("embeddings")

The default model is "nomic-embed-text-v1.5.Q8_0.gguf", if no name is provided.

For available pretrained models please see the Models Hub.

For extended examples of usage, see the AutoGGUFEmbeddingsTest and the example notebook.

Note: To use GPU inference with this annotator, make sure to use the Spark NLP GPU package and set the number of GPU layers with the setNGpuLayers method.

When using larger models, we recommend adjusting GPU usage with setNCtx and setNGpuLayers according to your hardware to avoid out-of-memory errors.

Input Annotator Types: DOCUMENT

Output Annotator Type: SENTENCE_EMBEDDINGS

Python API: AutoGGUFEmbeddings

Scala API: AutoGGUFEmbeddings

Source: AutoGGUFEmbeddings

Show Example

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> document = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> autoGGUFEmbeddings = AutoGGUFEmbeddings.pretrained() \
...     .setInputCols(["document"]) \
...     .setOutputCol("completions") \
...     .setBatchSize(4) \
...     .setNGpuLayers(99) \
...     .setPoolingType("MEAN")
>>> pipeline = Pipeline().setStages([document, autoGGUFEmbeddings])
>>> data = spark.createDataFrame([["The moons of Jupiter are 77 in total, with 79 confirmed natural satellites and 2 man-made ones."]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.select("completions").show()
+--------------------------------------------------------------------------------+
|                                                                      embeddings|
+--------------------------------------------------------------------------------+
|[[-0.034486726, 0.07770534, -0.15982522, -0.017873349, 0.013914132, 0.0365736...|
+--------------------------------------------------------------------------------+

import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline
import spark.implicits._

val document = new DocumentAssembler().setInputCol("text").setOutputCol("document")

val autoGGUFEmbeddings = AutoGGUFEmbeddings
  .pretrained()
  .setInputCols("document")
  .setOutputCol("embeddings")
  .setBatchSize(4)
  .setPoolingType("MEAN")

val pipeline = new Pipeline().setStages(Array(document, autoGGUFEmbeddings))

val data = Seq(
  "The moons of Jupiter are 77 in total, with 79 confirmed natural satellites and 2 man-made ones.")
  .toDF("text")
val result = pipeline.fit(data).transform(data)
result.select("embeddings.embeddings").show(1, truncate=80)
+--------------------------------------------------------------------------------+
|                                                                      embeddings|
+--------------------------------------------------------------------------------+
|[[-0.034486726, 0.07770534, -0.15982522, -0.017873349, 0.013914132, 0.0365736...|
+--------------------------------------------------------------------------------+

AutoGGUFModel

Annotator that uses the llama.cpp library to generate text completions with large language models.

For settable parameters, and their explanations, see HasLlamaCppProperties and refer to the llama.cpp documentation of server.cpp for more information.

If the parameters are not set, the annotator will default to use the parameters provided by the model.

Pretrained models can be loaded with pretrained of the companion object:

val autoGGUFModel = AutoGGUFModel.pretrained()
  .setInputCols("document")
  .setOutputCol("completions")

The default model is "phi3.5_mini_4k_instruct_q4_gguf", if no name is provided.

For available pretrained models please see the Models Hub.

For extended examples of usage, see the AutoGGUFModelTest and the example notebook.

Note: To use GPU inference with this annotator, make sure to use the Spark NLP GPU package and set the number of GPU layers with the setNGpuLayers method.

When using larger models, we recommend adjusting GPU usage with setNCtx and setNGpuLayers according to your hardware to avoid out-of-memory errors.

Input Annotator Types: DOCUMENT

Output Annotator Type: DOCUMENT

Python API: AutoGGUFModel

Scala API: AutoGGUFModel

Source: AutoGGUFModel

Show Example

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> document = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> autoGGUFModel = AutoGGUFModel.pretrained() \
...     .setInputCols(["document"]) \
...     .setOutputCol("completions") \
...     .setBatchSize(4) \
...     .setNPredict(20) \
...     .setNGpuLayers(99) \
...     .setTemperature(0.4) \
...     .setTopK(40) \
...     .setTopP(0.9) \
...     .setPenalizeNl(True)
>>> pipeline = Pipeline().setStages([document, autoGGUFModel])
>>> data = spark.createDataFrame([["Hello, I am a"]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.select("completions").show(truncate = False)
+-----------------------------------------------------------------------------------------------------------------------------------+
|completions                                                                                                                        |
+-----------------------------------------------------------------------------------------------------------------------------------+
|[{document, 0, 78,  new user.  I am currently working on a project and I need to create a list of , {prompt -> Hello, I am a}, []}]|
+-----------------------------------------------------------------------------------------------------------------------------------+

import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline
import spark.implicits._

val document = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val autoGGUFModel = AutoGGUFModel
  .pretrained()
  .setInputCols("document")
  .setOutputCol("completions")
  .setBatchSize(4)
  .setNPredict(20)
  .setNGpuLayers(99)
  .setTemperature(0.4f)
  .setTopK(40)
  .setTopP(0.9f)
  .setPenalizeNl(true)

val pipeline = new Pipeline().setStages(Array(document, autoGGUFModel))

val data = Seq("Hello, I am a").toDF("text")
val result = pipeline.fit(data).transform(data)
result.select("completions").show(truncate = false)
+-----------------------------------------------------------------------------------------------------------------------------------+
|completions                                                                                                                        |
+-----------------------------------------------------------------------------------------------------------------------------------+
|[{document, 0, 78,  new user.  I am currently working on a project and I need to create a list of , {prompt -> Hello, I am a}, []}]|
+-----------------------------------------------------------------------------------------------------------------------------------+

    ````markdown

AutoGGUFReranker

Annotator that uses the llama.cpp library to rerank text documents based on their relevance to a given query using GGUF-format reranking models.

This annotator is specifically designed for text reranking tasks, where multiple documents or text passages are ranked according to their relevance to a query. It uses specialized reranking models in GGUF format that output relevance scores for each input document.

The reranker takes a query (set via setQuery) and a list of documents, then returns the same documents with added metadata containing relevance scores. The documents are processed in batches and each receives a relevance_score in its metadata indicating how relevant it is to the provided query.

For settable parameters, and their explanations, see HasLlamaCppInferenceProperties, HasLlamaCppModelProperties and refer to the llama.cpp documentation of server.cpp for more information.

If the parameters are not set, the annotator will default to use the parameters provided by the model.

Pretrained models can be loaded with pretrained of the companion object:

val reranker = AutoGGUFReranker.pretrained()
  .setInputCols("document")
  .setOutputCol("reranked_documents")
  .setQuery("A man is eating pasta.")

The default model is "bge-reranker-v2-m3-Q4_K_M", if no name is provided.

For available pretrained models please see the Models Hub.

For extended examples of usage, see the AutoGGUFRerankerTest and the example notebook.

Note: This annotator is designed for reranking tasks and requires setting a query using setQuery. The query represents the search intent against which documents will be ranked. Each input document receives a relevance score in the output metadata.

To use GPU inference with this annotator, make sure to use the Spark NLP GPU package and set the number of GPU layers with the setNGpuLayers method.

When using larger models, we recommend adjusting GPU usage with setNCtx and setNGpuLayers according to your hardware to avoid out-of-memory errors.

Input Annotator Types: DOCUMENT

Output Annotator Type: DOCUMENT

Python API: AutoGGUFReranker

Scala API: AutoGGUFReranker

Source: AutoGGUFReranker

Show Example

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> document = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> reranker = AutoGGUFReranker.pretrained() \
...     .setInputCols(["document"]) \
...     .setOutputCol("reranked_documents") \
...     .setBatchSize(4) \
...     .setQuery("A man is eating pasta.") \
...     .setNGpuLayers(99)
>>> pipeline = Pipeline().setStages([document, reranker])
>>> data = spark.createDataFrame([
...     ["A man is eating food."],
...     ["A man is eating a piece of bread."],
...     ["The girl is carrying a baby."],
...     ["A man is riding a horse."]
... ]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.select("reranked_documents").show(truncate = False)
+-------------------------------------------------------------------------------------------+
|reranked_documents                                                                         |
+-------------------------------------------------------------------------------------------+
|[{document, 0, 20, A man is eating food., {query -> A man is eating pasta., relevance_...}]|
|[{document, 0, 31, A man is eating a piece of bread., {query -> A man is eating pasta.,...}]|
|[{document, 0, 27, The girl is carrying a baby., {query -> A man is eating pasta., rel...}]|
|[{document, 0, 22, A man is riding a horse., {query -> A man is eating pasta., relevan...}]|
+-------------------------------------------------------------------------------------------+

import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline
import spark.implicits._

val document = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val reranker = AutoGGUFReranker
  .pretrained("bge-reranker-v2-m3-Q4_K_M")
  .setInputCols("document")
  .setOutputCol("reranked_documents")
  .setBatchSize(4)
  .setQuery("A man is eating pasta.")
  .setNGpuLayers(99)

val pipeline = new Pipeline().setStages(Array(document, reranker))

val data = Seq(
  "A man is eating food.",
  "A man is eating a piece of bread.",
  "The girl is carrying a baby.",
  "A man is riding a horse."
).toDF("text")
val result = pipeline.fit(data).transform(data)
result.select("reranked_documents").show(truncate = false)
+-------------------------------------------------------------------------------------------+
|reranked_documents                                                                         |
+-------------------------------------------------------------------------------------------+
|[{document, 0, 20, A man is eating food., {query -> A man is eating pasta., relevance_...}]|
|[{document, 0, 31, A man is eating a piece of bread., {query -> A man is eating pasta.,...}]|
|[{document, 0, 27, The girl is carrying a baby., {query -> A man is eating pasta., rel...}]|
|[{document, 0, 22, A man is riding a horse., {query -> A man is eating pasta., relevan...}]|
+-------------------------------------------------------------------------------------------+


    

    
        
        
<div class="h3-box tabs-python-scala-box" markdown="1">

## AutoGGUFVisionModel

Multimodal annotator that uses the llama.cpp library to generate text completions with large
language models. It supports ingesting images for captioning.

At the moment only CLIP based models are supported.

For settable parameters, and their explanations, see HasLlamaCppInferenceProperties,
HasLlamaCppModelProperties and refer to the llama.cpp documentation of
[server.cpp](https://github.com/ggerganov/llama.cpp/tree/7d5e8777ae1d21af99d4f95be10db4870720da91/examples/server)
for more information.

If the parameters are not set, the annotator will default to use the parameters provided by
the model.

This annotator expects a column of annotator type AnnotationImage for the image and
Annotation for the caption. Note that the image bytes in the image annotation need to be
raw image bytes without preprocessing. We provide the helper function
ImageAssembler.loadImagesAsBytes to load the image bytes from a directory.

Pretrained models can be loaded with `pretrained` of the companion object:

```scala
val autoGGUFVisionModel = AutoGGUFVisionModel.pretrained()
  .setInputCols("image", "document")
  .setOutputCol("completions")

The default model is "llava_v1.5_7b_Q4_0_gguf", if no name is provided.

For available pretrained models please see the Models Hub.

For extended examples of usage, see the AutoGGUFVisionModelTest and the example notebook.

Note: To use GPU inference with this annotator, make sure to use the Spark NLP GPU package and set the number of GPU layers with the setNGpuLayers method.

When using larger models, we recommend adjusting GPU usage with setNCtx and setNGpuLayers according to your hardware to avoid out-of-memory errors.

Input Annotator Types: IMAGE, DOCUMENT

Output Annotator Type: DOCUMENT

Python API: AutoGGUFVisionModel

Scala API: AutoGGUFVisionModel

Source: AutoGGUFVisionModel

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
from pyspark.sql.functions import lit

documentAssembler = DocumentAssembler() \
    .setInputCol("caption") \
    .setOutputCol("caption_document")
imageAssembler = ImageAssembler() \
    .setInputCol("image") \
    .setOutputCol("image_assembler")

imagesPath = "src/test/resources/image/"
data = ImageAssembler \
    .loadImagesAsBytes(spark, imagesPath) \
    .withColumn("caption", lit("Caption this image.")) # Add a caption to each image.

nPredict = 40
model = AutoGGUFVisionModel.pretrained() \
    .setInputCols(["caption_document", "image_assembler"]) \
    .setOutputCol("completions") \
    .setBatchSize(4) \
    .setNGpuLayers(99) \
    .setNCtx(4096) \
    .setMinKeep(0) \
    .setMinP(0.05) \
    .setNPredict(nPredict) \
    .setNProbs(0) \
    .setPenalizeNl(False) \
    .setRepeatLastN(256) \
    .setRepeatPenalty(1.18) \
    .setStopStrings(["</s>", "Llama:", "User:"]) \
    .setTemperature(0.05) \
    .setTfsZ(1) \
    .setTypicalP(1) \
    .setTopK(40) \
    .setTopP(0.95)

pipeline = Pipeline().setStages([documentAssembler, imageAssembler, model])
pipeline.fit(data).transform(data) \
    .selectExpr("reverse(split(image.origin, '/'))[0] as image_name", "completions.result") \
    .show(truncate = False)
+-----------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|image_name       |result                                                                                                                                                                                        |
+-----------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|palace.JPEG      |[ The image depicts a large, ornate room with high ceilings and beautifully decorated walls. There are several chairs placed throughout the space, some of which have cushions]               |
|egyptian_cat.jpeg|[ The image features two cats lying on a pink surface, possibly a bed or sofa. One cat is positioned towards the left side of the scene and appears to be sleeping while holding]             |
|hippopotamus.JPEG|[ A large brown hippo is swimming in a body of water, possibly an aquarium. The hippo appears to be enjoying its time in the water and seems relaxed as it floats]                            |
|hen.JPEG         |[ The image features a large chicken standing next to several baby chickens. In total, there are five birds in the scene: one adult and four young ones. They appear to be gathered together] |
|ostrich.JPEG     |[ The image features a large, long-necked bird standing in the grass. It appears to be an ostrich or similar species with its head held high and looking around. In addition to]              |
|junco.JPEG       |[ A small bird with a black head and white chest is standing on the snow. It appears to be looking at something, possibly food or another animal in its vicinity. The scene takes place out]  |
|bluetick.jpg     |[ A dog with a red collar is sitting on the floor, looking at something. The dog appears to be staring into the distance or focusing its attention on an object in front of it.]              |
|chihuahua.jpg    |[ A small brown dog wearing a sweater is sitting on the floor. The dog appears to be looking at something, possibly its owner or another animal in the room. It seems comfortable and relaxed]|
|tractor.JPEG     |[ A man is sitting in the driver's seat of a green tractor, which has yellow wheels and tires. The tractor appears to be parked on top of an empty field with]                                |
|ox.JPEG          |[ A large bull with horns is standing in a grassy field.]                                                                                                                                     |
+-----------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

import com.johnsnowlabs.nlp.ImageAssembler
import com.johnsnowlabs.nlp.annotator._
import com.johnsnowlabs.nlp.base._
import org.apache.spark.ml.Pipeline
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.lit

val documentAssembler = new DocumentAssembler()
  .setInputCol("caption")
  .setOutputCol("caption_document")

val imageAssembler = new ImageAssembler()
  .setInputCol("image")
  .setOutputCol("image_assembler")

val imagesPath = "src/test/resources/image/"
val data: DataFrame = ImageAssembler
  .loadImagesAsBytes(ResourceHelper.spark, imagesPath)
  .withColumn("caption", lit("Caption this image.")) // Add a caption to each image.

val nPredict = 40
val model = AutoGGUFVisionModel.pretrained()
  .setInputCols("caption_document", "image_assembler")
  .setOutputCol("completions")
  .setBatchSize(4)
  .setNGpuLayers(99)
  .setNCtx(4096)
  .setMinKeep(0)
  .setMinP(0.05f)
  .setNPredict(nPredict)
  .setNProbs(0)
  .setPenalizeNl(false)
  .setRepeatLastN(256)
  .setRepeatPenalty(1.18f)
  .setStopStrings(Array("</s>", "Llama:", "User:"))
  .setTemperature(0.05f)
  .setTfsZ(1)
  .setTypicalP(1)
  .setTopK(40)
  .setTopP(0.95f)

val pipeline = new Pipeline().setStages(Array(documentAssembler, imageAssembler, model))
pipeline
  .fit(data)
  .transform(data)
  .selectExpr("reverse(split(image.origin, '/'))[0] as image_name", "completions.result")
  .show(truncate = false)
+-----------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|image_name       |result                                                                                                                                                                                        |
+-----------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|palace.JPEG      |[ The image depicts a large, ornate room with high ceilings and beautifully decorated walls. There are several chairs placed throughout the space, some of which have cushions]               |
|egyptian_cat.jpeg|[ The image features two cats lying on a pink surface, possibly a bed or sofa. One cat is positioned towards the left side of the scene and appears to be sleeping while holding]             |
|hippopotamus.JPEG|[ A large brown hippo is swimming in a body of water, possibly an aquarium. The hippo appears to be enjoying its time in the water and seems relaxed as it floats]                            |
|hen.JPEG         |[ The image features a large chicken standing next to several baby chickens. In total, there are five birds in the scene: one adult and four young ones. They appear to be gathered together] |
|ostrich.JPEG     |[ The image features a large, long-necked bird standing in the grass. It appears to be an ostrich or similar species with its head held high and looking around. In addition to]              |
|junco.JPEG       |[ A small bird with a black head and white chest is standing on the snow. It appears to be looking at something, possibly food or another animal in its vicinity. The scene takes place out]  |
|bluetick.jpg     |[ A dog with a red collar is sitting on the floor, looking at something. The dog appears to be staring into the distance or focusing its attention on an object in front of it.]              |
|chihuahua.jpg    |[ A small brown dog wearing a sweater is sitting on the floor. The dog appears to be looking at something, possibly its owner or another animal in the room. It seems comfortable and relaxed]|
|tractor.JPEG     |[ A man is sitting in the driver's seat of a green tractor, which has yellow wheels and tires. The tractor appears to be parked on top of an empty field with]                                |
|ox.JPEG          |[ A large bull with horns is standing in a grassy field.]                                                                                                                                     |
+-----------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

</div>

BGEEmbeddings

Sentence embeddings using BGE.

BGE, or BAAI General Embeddings, a model that can map any text to a low-dimensional dense vector which can be used for tasks like retrieval, classification, clustering, or semantic search.

Note that this annotator is only supported for Spark Versions 3.4 and up.

Pretrained models can be loaded with pretrained of the companion object:

val embeddings = BGEEmbeddings.pretrained()
  .setInputCols("document")
  .setOutputCol("embeddings")

The default model is "bge_base", if no name is provided.

For available pretrained models please see the Models Hub.

For extended examples of usage, see BGEEmbeddingsTestSpec.

Sources :

C-Pack: Packaged Resources To Advance General Chinese Embedding

BGE Github Repository

Paper abstract

We introduce C-Pack, a package of resources that significantly advance the field of general Chinese embeddings. C-Pack includes three critical resources. 1) C-MTEB is a comprehensive benchmark for Chinese text embeddings covering 6 tasks and 35 datasets. 2) C-MTP is a massive text embedding dataset curated from labeled and unlabeled Chinese corpora for training embedding models. 3) C-TEM is a family of embedding models covering multiple sizes. Our models outperform all prior Chinese text embeddings on C-MTEB by up to +10% upon the time of the release. We also integrate and optimize the entire suite of training methods for C-TEM. Along with our resources on general Chinese embedding, we release our data and models for English text embeddings. The English models achieve stateof-the-art performance on the MTEB benchmark; meanwhile, our released English data is 2 times larger than the Chinese data. All these resources are made publicly available at https://github.com/FlagOpen/FlagEmbedding.

Input Annotator Types: DOCUMENT

Output Annotator Type: SENTENCE_EMBEDDINGS

Python API: BGEEmbeddings

Scala API: BGEEmbeddings

Source: BGEEmbeddings

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")
embeddings = BGEEmbeddings.pretrained() \
    .setInputCols(["document"]) \
    .setOutputCol("bge_embeddings")
embeddingsFinisher = EmbeddingsFinisher() \
    .setInputCols(["bge_embeddings"]) \
    .setOutputCols("finished_embeddings") \
    .setOutputAsVector(True)
pipeline = Pipeline().setStages([
    documentAssembler,
    embeddings,
    embeddingsFinisher
])
data = spark.createDataFrame([["query: how much protein should a female eat",
"passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day." + \
"But, as you can see from this chart, you'll need to increase that if you're expecting or training for a" + \
"marathon. Check out the chart below to see how much protein you should be eating each day.",
]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[[8.0190285E-4, -0.005974853, -0.072875895, 0.007944068, 0.026059335, -0.0080...|
|[[0.050514214, 0.010061974, -0.04340176, -0.020937217, 0.05170225, 0.01157857...|
+--------------------------------------------------------------------------------+

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.embeddings.BGEEmbeddings
import com.johnsnowlabs.nlp.EmbeddingsFinisher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val embeddings = BGEEmbeddings.pretrained("bge_base", "en")
  .setInputCols("document")
  .setOutputCol("bge_embeddings")

val embeddingsFinisher = new EmbeddingsFinisher()
  .setInputCols("bge_embeddings")
  .setOutputCols("finished_embeddings")
  .setOutputAsVector(true)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  embeddings,
  embeddingsFinisher
))

val data = Seq("query: how much protein should a female eat",
"passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day." +
But, as you can see from this chart, you'll need to increase that if you're expecting or training for a" +
marathon. Check out the chart below to see how much protein you should be eating each day."

).toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(1, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[[8.0190285E-4, -0.005974853, -0.072875895, 0.007944068, 0.026059335, -0.0080...|
|[[0.050514214, 0.010061974, -0.04340176, -0.020937217, 0.05170225, 0.01157857...|
+--------------------------------------------------------------------------------+

BigTextMatcher

Annotator to match exact phrases (by token) provided in a file against a Document.

A text file of predefined phrases must be provided with setStoragePath.

In contrast to the normal TextMatcher, the BigTextMatcher is designed for large corpora.

For extended examples of usage, see the BigTextMatcherTestSpec.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: CHUNK

Python API: BigTextMatcher

Scala API: BigTextMatcher

Source: BigTextMatcher

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
# In this example, the entities file is of the form
#
# ...
# dolore magna aliqua
# lorem ipsum dolor. sit
# laborum
# ...
#
# where each line represents an entity phrase to be extracted.

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols("document") \
    .setOutputCol("token")

data = spark.createDataFrame([["Hello dolore magna aliqua. Lorem ipsum dolor. sit in laborum"]]).toDF("text")
entityExtractor = BigTextMatcher() \
    .setInputCols("document", "token") \
    .setStoragePath("src/test/resources/entity-extractor/test-phrases.txt", ReadAs.TEXT) \
    .setOutputCol("entity") \
    .setCaseSensitive(False)

pipeline = Pipeline().setStages([documentAssembler, tokenizer, entityExtractor])
results = pipeline.fit(data).transform(data)
results.selectExpr("explode(entity)").show(truncate=False)
+--------------------------------------------------------------------+
|col                                                                 |
+--------------------------------------------------------------------+
|[chunk, 6, 24, dolore magna aliqua, [sentence -> 0, chunk -> 0], []]|
|[chunk, 53, 59, laborum, [sentence -> 0, chunk -> 1], []]           |
+--------------------------------------------------------------------+

// In this example, the entities file is of the form
//
// ...
// dolore magna aliqua
// lorem ipsum dolor. sit
// laborum
// ...
//
// where each line represents an entity phrase to be extracted.
import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.Tokenizer
import com.johnsnowlabs.nlp.annotator.BigTextMatcher
import com.johnsnowlabs.nlp.util.io.ReadAs
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val data = Seq("Hello dolore magna aliqua. Lorem ipsum dolor. sit in laborum").toDF("text")
val entityExtractor = new BigTextMatcher()
  .setInputCols("document", "token")
  .setStoragePath("src/test/resources/entity-extractor/test-phrases.txt", ReadAs.TEXT)
  .setOutputCol("entity")
  .setCaseSensitive(false)

val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, entityExtractor))
val results = pipeline.fit(data).transform(data)
results.selectExpr("explode(entity)").show(false)
+--------------------------------------------------------------------+
|col                                                                 |
+--------------------------------------------------------------------+
|[chunk, 6, 24, dolore magna aliqua, [sentence -> 0, chunk -> 0], []]|
|[chunk, 53, 59, laborum, [sentence -> 0, chunk -> 1], []]           |
+--------------------------------------------------------------------+

Chunk2Doc

Converts a CHUNK type column back into DOCUMENT. Useful when trying to re-tokenize or do further analysis on a CHUNK result.

Input Annotator Types: CHUNK

Output Annotator Type: DOCUMENT

Python API: Chunk2Doc

Scala API: Chunk2Doc

Source: Chunk2Doc

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
from sparknlp.pretrained import PretrainedPipeline
# Location entities are extracted and converted back into `DOCUMENT` type for further processing

data = spark.createDataFrame([[1, "New York and New Jersey aren't that far apart actually."]]).toDF("id", "text")

# Extracts Named Entities amongst other things
pipeline = PretrainedPipeline("explain_document_dl")

chunkToDoc = Chunk2Doc().setInputCols("entities").setOutputCol("chunkConverted")
explainResult = pipeline.transform(data)

result = chunkToDoc.transform(explainResult)
result.selectExpr("explode(chunkConverted)").show(truncate=False)
+------------------------------------------------------------------------------+
|col                                                                           |
+------------------------------------------------------------------------------+
|[document, 0, 7, New York, [entity -> LOC, sentence -> 0, chunk -> 0], []]    |
|[document, 13, 22, New Jersey, [entity -> LOC, sentence -> 0, chunk -> 1], []]|
+------------------------------------------------------------------------------+

// Location entities are extracted and converted back into `DOCUMENT` type for further processing
import spark.implicits._
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
import com.johnsnowlabs.nlp.Chunk2Doc

val data = Seq((1, "New York and New Jersey aren't that far apart actually.")).toDF("id", "text")

// Extracts Named Entities amongst other things
val pipeline = PretrainedPipeline("explain_document_dl")

val chunkToDoc = new Chunk2Doc().setInputCols("entities").setOutputCol("chunkConverted")
val explainResult = pipeline.transform(data)

val result = chunkToDoc.transform(explainResult)
result.selectExpr("explode(chunkConverted)").show(false)
+------------------------------------------------------------------------------+
|col                                                                           |
+------------------------------------------------------------------------------+
|[document, 0, 7, New York, [entity -> LOC, sentence -> 0, chunk -> 0], []]    |
|[document, 13, 22, New Jersey, [entity -> LOC, sentence -> 0, chunk -> 1], []]|
+------------------------------------------------------------------------------+

ChunkEmbeddings

This annotator utilizes WordEmbeddings, BertEmbeddings etc. to generate chunk embeddings from either Chunker, NGramGenerator, or NerConverter outputs.

For extended examples of usage, see the Examples and the ChunkEmbeddingsTestSpec.

Input Annotator Types: CHUNK, WORD_EMBEDDINGS

Output Annotator Type: WORD_EMBEDDINGS

Python API: ChunkEmbeddings

Scala API: ChunkEmbeddings

Source: ChunkEmbeddings

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

# Extract the Embeddings from the NGrams
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

nGrams = NGramGenerator() \
    .setInputCols(["token"]) \
    .setOutputCol("chunk") \
    .setN(2)

embeddings = WordEmbeddingsModel.pretrained() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings") \
    .setCaseSensitive(False)

# Convert the NGram chunks into Word Embeddings
chunkEmbeddings = ChunkEmbeddings() \
    .setInputCols(["chunk", "embeddings"]) \
    .setOutputCol("chunk_embeddings") \
    .setPoolingStrategy("AVERAGE")

pipeline = Pipeline() \
    .setStages([
      documentAssembler,
      sentence,
      tokenizer,
      nGrams,
      embeddings,
      chunkEmbeddings
    ])

data = spark.createDataFrame([["This is a sentence."]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("explode(chunk_embeddings) as result") \
    .select("result.annotatorType", "result.result", "result.embeddings") \
    .show(5, 80)
+---------------+----------+--------------------------------------------------------------------------------+
|  annotatorType|    result|                                                                      embeddings|
+---------------+----------+--------------------------------------------------------------------------------+
|word_embeddings|   This is|[-0.55661, 0.42829502, 0.86661, -0.409785, 0.06316501, 0.120775, -0.0732005, ...|
|word_embeddings|      is a|[-0.40674996, 0.22938299, 0.50597, -0.288195, 0.555655, 0.465145, 0.140118, 0...|
|word_embeddings|a sentence|[0.17417, 0.095253006, -0.0530925, -0.218465, 0.714395, 0.79860497, 0.0129999...|
|word_embeddings|sentence .|[0.139705, 0.177955, 0.1887775, -0.45545, 0.20030999, 0.461557, -0.07891501, ...|
+---------------+----------+--------------------------------------------------------------------------------+

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
import com.johnsnowlabs.nlp.annotators.{NGramGenerator, Tokenizer}
import com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel
import com.johnsnowlabs.nlp.embeddings.ChunkEmbeddings
import org.apache.spark.ml.Pipeline

// Extract the Embeddings from the NGrams
val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentence = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val nGrams = new NGramGenerator()
  .setInputCols("token")
  .setOutputCol("chunk")
  .setN(2)

val embeddings = WordEmbeddingsModel.pretrained()
  .setInputCols("sentence", "token")
  .setOutputCol("embeddings")
  .setCaseSensitive(false)

// Convert the NGram chunks into Word Embeddings
val chunkEmbeddings = new ChunkEmbeddings()
  .setInputCols("chunk", "embeddings")
  .setOutputCol("chunk_embeddings")
  .setPoolingStrategy("AVERAGE")

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    sentence,
    tokenizer,
    nGrams,
    embeddings,
    chunkEmbeddings
  ))

val data = Seq("This is a sentence.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(chunk_embeddings) as result")
  .select("result.annotatorType", "result.result", "result.embeddings")
  .show(5, 80)
+---------------+----------+--------------------------------------------------------------------------------+
|  annotatorType|    result|                                                                      embeddings|
+---------------+----------+--------------------------------------------------------------------------------+
|word_embeddings|   This is|[-0.55661, 0.42829502, 0.86661, -0.409785, 0.06316501, 0.120775, -0.0732005, ...|
|word_embeddings|      is a|[-0.40674996, 0.22938299, 0.50597, -0.288195, 0.555655, 0.465145, 0.140118, 0...|
|word_embeddings|a sentence|[0.17417, 0.095253006, -0.0530925, -0.218465, 0.714395, 0.79860497, 0.0129999...|
|word_embeddings|sentence .|[0.139705, 0.177955, 0.1887775, -0.45545, 0.20030999, 0.461557, -0.07891501, ...|
+---------------+----------+--------------------------------------------------------------------------------+

ChunkTokenizer

Tokenizes and flattens extracted NER chunks.

The ChunkTokenizer will split the extracted NER CHUNK type Annotations and will create TOKEN type Annotations. The result is then flattened, resulting in a single array.

For extended examples of usage, see the ChunkTokenizerTestSpec.

Input Annotator Types: CHUNK

Output Annotator Type: TOKEN

Python API: ChunkTokenizer

Scala API: ChunkTokenizer

Source: ChunkTokenizer

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

entityExtractor = TextMatcher() \
    .setInputCols(["sentence", "token"]) \
    .setEntities("src/test/resources/entity-extractor/test-chunks.txt", ReadAs.TEXT) \
    .setOutputCol("entity")

chunkTokenizer = ChunkTokenizer() \
    .setInputCols(["entity"]) \
    .setOutputCol("chunk_token")

pipeline = Pipeline().setStages([
      documentAssembler,
      sentenceDetector,
      tokenizer,
      entityExtractor,
      chunkTokenizer
    ])

data = spark.createDataFrame([[
    "Hello world, my name is Michael, I am an artist and I work at Benezar",
    "Robert, an engineer from Farendell, graduated last year. The other one, Lucas, graduated last week."
]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("entity.result as entity" , "chunk_token.result as chunk_token").show(truncate=False)
+-----------------------------------------------+---------------------------------------------------+
|entity                                         |chunk_token                                        |
+-----------------------------------------------+---------------------------------------------------+
|[world, Michael, work at Benezar]              |[world, Michael, work, at, Benezar]                |
|[engineer from Farendell, last year, last week]|[engineer, from, Farendell, last, year, last, week]|
+-----------------------------------------------+---------------------------------------------------+

import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.{ChunkTokenizer, TextMatcher, Tokenizer}
import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
import com.johnsnowlabs.nlp.util.io.ReadAs
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val entityExtractor = new TextMatcher()
  .setInputCols("sentence", "token")
  .setEntities("src/test/resources/entity-extractor/test-chunks.txt", ReadAs.TEXT)
  .setOutputCol("entity")

val chunkTokenizer = new ChunkTokenizer()
  .setInputCols("entity")
  .setOutputCol("chunk_token")

val pipeline = new Pipeline().setStages(Array(
    documentAssembler,
    sentenceDetector,
    tokenizer,
    entityExtractor,
    chunkTokenizer
  ))

val data = Seq(
  "Hello world, my name is Michael, I am an artist and I work at Benezar",
  "Robert, an engineer from Farendell, graduated last year. The other one, Lucas, graduated last week."
).toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("entity.result as entity" , "chunk_token.result as chunk_token").show(false)
+-----------------------------------------------+---------------------------------------------------+
|entity                                         |chunk_token                                        |
+-----------------------------------------------+---------------------------------------------------+
|[world, Michael, work at Benezar]              |[world, Michael, work, at, Benezar]                |
|[engineer from Farendell, last year, last week]|[engineer, from, Farendell, last, year, last, week]|
+-----------------------------------------------+---------------------------------------------------+

Chunker

This annotator matches a pattern of part-of-speech tags in order to return meaningful phrases from document. Extracted part-of-speech tags are mapped onto the sentence, which can then be parsed by regular expressions. The part-of-speech tags are wrapped by angle brackets <> to be easily distinguishable in the text itself. This example sentence will result in the form:

"Peter Pipers employees are picking pecks of pickled peppers."
"<NNP><NNP><NNS><VBP><VBG><NNS><IN><JJ><NNS><.>"

To then extract these tags, regexParsers need to be set with e.g.:

val chunker = new Chunker()
  .setInputCols("sentence", "pos")
  .setOutputCol("chunk")
  .setRegexParsers(Array("<NNP>+", "<NNS>+"))

When defining the regular expressions, tags enclosed in angle brackets are treated as groups, so here specifically "<NNP>+" means 1 or more nouns in succession. Additional patterns can also be set with addRegexParsers.

For more extended examples see the Examples) and the ChunkerTestSpec.

Input Annotator Types: DOCUMENT, POS

Output Annotator Type: CHUNK

Python API: Chunker

Scala API: Chunker

Source: Chunker

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols("document") \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

POSTag = PerceptronModel.pretrained() \
    .setInputCols("document", "token") \
    .setOutputCol("pos")

chunker = Chunker() \
    .setInputCols("sentence", "pos") \
    .setOutputCol("chunk") \
    .setRegexParsers(["<NNP>+", "<NNS>+"])

pipeline = Pipeline() \
    .setStages([
      documentAssembler,
      sentence,
      tokenizer,
      POSTag,
      chunker
    ])

data = spark.createDataFrame([["Peter Pipers employees are picking pecks of pickled peppers."]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("explode(chunk) as result").show(truncate=False)
+-------------------------------------------------------------+
|result                                                       |
+-------------------------------------------------------------+
|[chunk, 0, 11, Peter Pipers, [sentence -> 0, chunk -> 0], []]|
|[chunk, 13, 21, employees, [sentence -> 0, chunk -> 1], []]  |
|[chunk, 35, 39, pecks, [sentence -> 0, chunk -> 2], []]      |
|[chunk, 52, 58, peppers, [sentence -> 0, chunk -> 3], []]    |
+-------------------------------------------------------------+

import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.{Chunker, Tokenizer}
import com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel
import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentence = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val POSTag = PerceptronModel.pretrained()
  .setInputCols("document", "token")
  .setOutputCol("pos")

val chunker = new Chunker()
  .setInputCols("sentence", "pos")
  .setOutputCol("chunk")
  .setRegexParsers(Array("<NNP>+", "<NNS>+"))

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    sentence,
    tokenizer,
    POSTag,
    chunker
  ))

val data = Seq("Peter Pipers employees are picking pecks of pickled peppers.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(chunk) as result").show(false)
+-------------------------------------------------------------+
|result                                                       |
+-------------------------------------------------------------+
|[chunk, 0, 11, Peter Pipers, [sentence -> 0, chunk -> 0], []]|
|[chunk, 13, 21, employees, [sentence -> 0, chunk -> 1], []]  |
|[chunk, 35, 39, pecks, [sentence -> 0, chunk -> 2], []]      |
|[chunk, 52, 58, peppers, [sentence -> 0, chunk -> 3], []]    |
+-------------------------------------------------------------+

ClassifierDL

Trains a ClassifierDL for generic Multi-class Text Classification.

ClassifierDL uses the state-of-the-art Universal Sentence Encoder as an input for text classifications. The ClassifierDL annotator uses a deep learning model (DNNs) we have built inside TensorFlow and supports up to 100 classes.

For instantiated/pretrained models, see ClassifierDLModel.

Setting a test dataset to monitor model metrics can be done with .setTestDataset. The method expects a path to a parquet file containing a dataframe that has the same required columns as the training dataframe. The pre-processing steps for the training dataframe should also be applied to the test dataframe. The following example will show how to create the test dataset:

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val embeddings = UniversalSentenceEncoder.pretrained()
  .setInputCols("document")
  .setOutputCol("sentence_embeddings")

val preProcessingPipeline = new Pipeline().setStages(Array(documentAssembler, embeddings))

val Array(train, test) = data.randomSplit(Array(0.8, 0.2))
preProcessingPipeline
  .fit(test)
  .transform(test)
  .write
  .mode("overwrite")
  .parquet("test_data")

val classifier = new ClassifierDLApproach()
  .setInputCols("sentence_embeddings")
  .setOutputCol("category")
  .setLabelColumn("label")
  .setTestDataset("test_data")

For extended examples of usage, see the Examples [1] [2] and the ClassifierDLTestSpec.

Input Annotator Types: SENTENCE_EMBEDDINGS

Output Annotator Type: CATEGORY

Note: This annotator accepts a label column of a single item in either type of String, Int, Float, or Double. UniversalSentenceEncoder, BertSentenceEmbeddings, or SentenceEmbeddings can be used for the inputCol

Python API: ClassifierDLApproach

Scala API: ClassifierDLApproach

Source: ClassifierDLApproach

Show Example

# In this example, the training data `"sentiment.csv"` has the form of
#
# text,label
# This movie is the best movie I have wached ever! In my opinion this movie can win an award.,0
# This was a terrible movie! The acting was bad really bad!,1
# ...
#
# Then traning can be done like so:

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

smallCorpus = spark.read.option("header","True").csv("src/test/resources/classifier/sentiment.csv")

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

useEmbeddings = UniversalSentenceEncoder.pretrained() \
    .setInputCols("document") \
    .setOutputCol("sentence_embeddings")

docClassifier = ClassifierDLApproach() \
    .setInputCols("sentence_embeddings") \
    .setOutputCol("category") \
    .setLabelColumn("label") \
    .setBatchSize(64) \
    .setMaxEpochs(20) \
    .setLr(5e-3) \
    .setDropout(0.5)

pipeline = Pipeline() \
    .setStages(
      [
        documentAssembler,
        useEmbeddings,
        docClassifier
      ]
    )

pipelineModel = pipeline.fit(smallCorpus)

// In this example, the training data `"sentiment.csv"` has the form of
//
// text,label
// This movie is the best movie I have wached ever! In my opinion this movie can win an award.,0
// This was a terrible movie! The acting was bad really bad!,1
// ...
//
// Then traning can be done like so:

import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.embeddings.UniversalSentenceEncoder
import com.johnsnowlabs.nlp.annotators.classifier.dl.ClassifierDLApproach
import org.apache.spark.ml.Pipeline

val smallCorpus = spark.read.option("header","true").csv("src/test/resources/classifier/sentiment.csv")

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val useEmbeddings = UniversalSentenceEncoder.pretrained()
  .setInputCols("document")
  .setOutputCol("sentence_embeddings")

val docClassifier = new ClassifierDLApproach()
  .setInputCols("sentence_embeddings")
  .setOutputCol("category")
  .setLabelColumn("label")
  .setBatchSize(64)
  .setMaxEpochs(20)
  .setLr(5e-3f)
  .setDropout(0.5f)

val pipeline = new Pipeline()
  .setStages(
    Array(
      documentAssembler,
      useEmbeddings,
      docClassifier
    )
  )

val pipelineModel = pipeline.fit(smallCorpus)

ClassifierDL for generic Multi-class Text Classification.

This is the instantiated model of the ClassifierDLApproach. For training your own model, please see the documentation of that class.

Pretrained models can be loaded with pretrained of the companion object:

val classifierDL = ClassifierDLModel.pretrained()
  .setInputCols("sentence_embeddings")
  .setOutputCol("classification")

The default model is "classifierdl_use_trec6", if no name is provided. It uses embeddings from the UniversalSentenceEncoder and is trained on the TREC-6 dataset. For available pretrained models please see the Models Hub.

For extended examples of usage, see the Examples and the ClassifierDLTestSpec.

Input Annotator Types: SENTENCE_EMBEDDINGS

Output Annotator Type: CATEGORY

Python API: ClassifierDLModel

Scala API: ClassifierDLModel

Source: ClassifierDLModel

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols("document") \
    .setOutputCol("sentence")

useEmbeddings = UniversalSentenceEncoder.pretrained() \
    .setInputCols("document") \
    .setOutputCol("sentence_embeddings")

sarcasmDL = ClassifierDLModel.pretrained("classifierdl_use_sarcasm") \
    .setInputCols("sentence_embeddings") \
    .setOutputCol("sarcasm")

pipeline = Pipeline() \
    .setStages([
      documentAssembler,
      sentence,
      useEmbeddings,
      sarcasmDL
    ])

data = spark.createDataFrame([
    ["I'm ready!"],
    ["If I could put into words how much I love waking up at 6 am on Mondays I would."]
]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("explode(arrays_zip(sentence, sarcasm)) as out") \
    .selectExpr("out.sentence.result as sentence", "out.sarcasm.result as sarcasm") \
    .show(truncate=False)
+-------------------------------------------------------------------------------+-------+
|sentence                                                                       |sarcasm|
+-------------------------------------------------------------------------------+-------+
|I'm ready!                                                                     |normal |
|If I could put into words how much I love waking up at 6 am on Mondays I would.|sarcasm|
+-------------------------------------------------------------------------------+-------+

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.SentenceDetector
import com.johnsnowlabs.nlp.annotators.classifier.dl.ClassifierDLModel
import com.johnsnowlabs.nlp.embeddings.UniversalSentenceEncoder
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentence = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val useEmbeddings = UniversalSentenceEncoder.pretrained()
  .setInputCols("document")
  .setOutputCol("sentence_embeddings")

val sarcasmDL = ClassifierDLModel.pretrained("classifierdl_use_sarcasm")
  .setInputCols("sentence_embeddings")
  .setOutputCol("sarcasm")

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    sentence,
    useEmbeddings,
    sarcasmDL
  ))

val data = Seq(
  "I'm ready!",
  "If I could put into words how much I love waking up at 6 am on Mondays I would."
).toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(arrays_zip(sentence, sarcasm)) as out")
  .selectExpr("out.sentence.result as sentence", "out.sarcasm.result as sarcasm")
  .show(false)
+-------------------------------------------------------------------------------+-------+
|sentence                                                                       |sarcasm|
+-------------------------------------------------------------------------------+-------+
|I'm ready!                                                                     |normal |
|If I could put into words how much I love waking up at 6 am on Mondays I would.|sarcasm|
+-------------------------------------------------------------------------------+-------+

ContextSpellChecker

Trains a deep-learning based Noisy Channel Model Spell Algorithm. Correction candidates are extracted combining context information and word information.

For instantiated/pretrained models, see ContextSpellCheckerModel.

Spell Checking is a sequence to sequence mapping problem. Given an input sequence, potentially containing a certain number of errors, ContextSpellChecker will rank correction sequences according to three things:

Different correction candidates for each word — word level.
The surrounding text of each word, i.e. it’s context — sentence level.
The relative cost of different correction candidates according to the edit operations at the character level it requires — subword level.

For an in-depth explanation of the module see the article Applying Context Aware Spell Checking in Spark NLP.

For extended examples of usage, see the article Training a Contextual Spell Checker for Italian Language, the Examples and the ContextSpellCheckerTestSpec.

Input Annotator Types: TOKEN

Output Annotator Type: TOKEN

Python API: ContextSpellCheckerApproach

Scala API: ContextSpellCheckerApproach

Source: ContextSpellCheckerApproach

Show Example

# For this example, we use the first Sherlock Holmes book as the training dataset.

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")


tokenizer = Tokenizer() \
    .setInputCols("document") \
    .setOutputCol("token")

spellChecker = ContextSpellCheckerApproach() \
    .setInputCols("token") \
    .setOutputCol("corrected") \
    .setWordMaxDistance(3) \
    .setBatchSize(24) \
    .setEpochs(8) \
    .setLanguageModelClasses(1650)  # dependant on vocabulary size
    # .addVocabClass("_NAME_", names) # Extra classes for correction could be added like this

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    spellChecker
])

path = "sherlockholmes.txt"
dataset = spark.read.text(path) \
    .toDF("text")
pipelineModel = pipeline.fit(dataset)

// For this example, we use the first Sherlock Holmes book as the training dataset.

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.spell.context.ContextSpellCheckerApproach

import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")


val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val spellChecker = new ContextSpellCheckerApproach()
  .setInputCols("token")
  .setOutputCol("corrected")
  .setWordMaxDistance(3)
  .setBatchSize(24)
  .setEpochs(8)
  .setLanguageModelClasses(1650)  // dependant on vocabulary size
  // .addVocabClass("_NAME_", names) // Extra classes for correction could be added like this

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  tokenizer,
  spellChecker
))

val path = "src/test/resources/spell/sherlockholmes.txt"
val dataset = spark.sparkContext.textFile(path)
  .toDF("text")
val pipelineModel = pipeline.fit(dataset)

Date2Chunk

Converts DATE type Annotations to CHUNK type.

This can be useful if the following annotators after DateMatcher and MultiDateMatcher require CHUNK types. The entity name in the metadata can be changed with setEntityName.

Input Annotator Types: DATE

Output Annotator Type: CHUNK

Python API: Date2Chunk

Scala API: Date2Chunk

Source: Date2Chunk

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

date = DateMatcher() \
    .setInputCols(["document"]) \
    .setOutputCol("date")

date2Chunk = Date2Chunk() \
    .setInputCols(["date"]) \
    .setOutputCol("date_chunk")

pipeline = Pipeline().setStages([
    documentAssembler,
    date,
    date2Chunk
])

data = spark.createDataFrame([["Omicron is a new variant of COVID-19, which the World Health Organization designated a variant of concern on Nov. 26, 2021/26/11."]]).toDF("text")

result = pipeline.fit(data).transform(data)

result.select("date_chunk").show(1, truncate=False)
+----------------------------------------------------+
|date_chunk                                          |
+----------------------------------------------------+
|[{chunk, 118, 121, 2021/01/01, {sentence -> 0}, []}]|
+----------------------------------------------------+

import spark.implicits._
import com.johnsnowlabs.nlp.annotator._

import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val inputFormats = Array("yyyy", "yyyy/dd/MM", "MM/yyyy", "yyyy")
val outputFormat = "yyyy/MM/dd"

val date = new DateMatcher()
  .setInputCols("document")
  .setOutputCol("date")

val date2Chunk = new Date2Chunk()
  .setInputCols("date")
  .setOutputCol("date_chunk")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  date,
  date2Chunk
))

val data = Seq(
"""Omicron is a new variant of COVID-19, which the World Health Organization designated a variant of concern on Nov. 26, 2021/26/11.""",
"""Neighbouring Austria has already locked down its population this week for at until 2021/10/12, becoming the first to reimpose such restrictions."""
).toDF("text")
val result = pipeline.fit(data).transform(data)

result.transform(data).select("date_chunk").show(false)
----------------------------------------------------+
date_chunk                                          |
----------------------------------------------------+
[{chunk, 118, 121, 2021/01/01, {sentence -> 0}, []}]|
[{chunk, 83, 86, 2021/01/01, {sentence -> 0}, []}]  |
----------------------------------------------------+

DateMatcher

Matches standard date formats into a provided format.

Reads from different forms of date and time expressions and converts them to a provided date format.

Extracts only one date per document. Use with sentence detector to find matches in each sentence. To extract multiple dates from a document, please use the MultiDateMatcher.

Reads the following kind of dates:

"1978-01-28", "1984/04/02,1/02/1980", "2/28/79", "The 31st of April in the year 2008",
"Fri, 21 Nov 1997", "Jan 21, ‘97", "Sun", "Nov 21", "jan 1st", "next thursday",
"last wednesday", "today", "tomorrow", "yesterday", "next week", "next month",
"next year", "day after", "the day before", "0600h", "06:00 hours", "6pm", "5:30 a.m.",
"at 5", "12:59", "23:59", "1988/11/23 6pm", "next week at 7.30", "5 am tomorrow"

For example "The 31st of April in the year 2008" will be converted into 2008/04/31.

Pretrained pipelines are available for this module, see Pipelines.

For extended examples of usage, see the Examples and the DateMatcherTestSpec.

Input Annotator Types: DOCUMENT

Output Annotator Type: DATE

Python API: DateMatcher

Scala API: DateMatcher

Source: DateMatcher

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

date = DateMatcher() \
    .setInputCols("document") \
    .setOutputCol("date") \
    .setAnchorDateYear(2020) \
    .setAnchorDateMonth(1) \
    .setAnchorDateDay(11) \
    .setDateFormat("yyyy/MM/dd")

pipeline = Pipeline().setStages([
    documentAssembler,
    date
])

data = spark.createDataFrame([["Fri, 21 Nov 1997"], ["next week at 7.30"], ["see you a day after"]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("date").show(truncate=False)
+-------------------------------------------------+
|date                                             |
+-------------------------------------------------+
|[[date, 5, 15, 1997/11/21, [sentence -> 0], []]] |
|[[date, 0, 8, 2020/01/18, [sentence -> 0], []]]  |
|[[date, 10, 18, 2020/01/12, [sentence -> 0], []]]|
+-------------------------------------------------+

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.DateMatcher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val date = new DateMatcher()
  .setInputCols("document")
  .setOutputCol("date")
  .setAnchorDateYear(2020)
  .setAnchorDateMonth(1)
  .setAnchorDateDay(11)
  .setDateFormat("yyyy/MM/dd")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  date
))

val data = Seq("Fri, 21 Nov 1997", "next week at 7.30", "see you a day after").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("date").show(false)
+-------------------------------------------------+
|date                                             |
+-------------------------------------------------+
|[[date, 5, 15, 1997/11/21, [sentence -> 0], []]] |
|[[date, 0, 8, 2020/01/18, [sentence -> 0], []]]  |
|[[date, 10, 18, 2020/01/12, [sentence -> 0], []]]|
+-------------------------------------------------+

DependencyParser

Trains an unlabeled parser that finds a grammatical relations between two words in a sentence.

For instantiated/pretrained models, see DependencyParserModel.

Dependency parser provides information about word relationship. For example, dependency parsing can tell you what the subjects and objects of a verb are, as well as which words are modifying (describing) the subject. This can help you find precise answers to specific questions.

The required training data can be set in two different ways (only one can be chosen for a particular model):

Dependency treebank in the Penn Treebank format set with setDependencyTreeBank
Dataset in the CoNLL-U format set with setConllU

Apart from that, no additional training data is needed.

See DependencyParserApproachTestSpec for further reference on how to use this API.

Input Annotator Types: DOCUMENT, POS, TOKEN

Output Annotator Type: DEPENDENCY

Python API: DependencyParserApproach

Scala API: DependencyParserApproach

Source: DependencyParserApproach

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols("document") \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols("sentence") \
    .setOutputCol("token")

posTagger = PerceptronModel.pretrained() \
    .setInputCols("sentence", "token") \
    .setOutputCol("pos")

dependencyParserApproach = DependencyParserApproach() \
    .setInputCols("sentence", "pos", "token") \
    .setOutputCol("dependency") \
    .setDependencyTreeBank("src/test/resources/parser/unlabeled/dependency_treebank")

pipeline = Pipeline().setStages([
    documentAssembler,
    sentence,
    tokenizer,
    posTagger,
    dependencyParserApproach
])

# Additional training data is not needed, the dependency parser relies on the dependency tree bank / CoNLL-U only.
emptyDataSet = spark.createDataFrame([[""]]).toDF("text")
pipelineModel = pipeline.fit(emptyDataSet)

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel
import com.johnsnowlabs.nlp.annotators.parser.dep.DependencyParserApproach
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentence = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

val posTagger = PerceptronModel.pretrained()
  .setInputCols("sentence", "token")
  .setOutputCol("pos")

val dependencyParserApproach = new DependencyParserApproach()
  .setInputCols("sentence", "pos", "token")
  .setOutputCol("dependency")
  .setDependencyTreeBank("src/test/resources/parser/unlabeled/dependency_treebank")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentence,
  tokenizer,
  posTagger,
  dependencyParserApproach
))

// Additional training data is not needed, the dependency parser relies on the dependency tree bank / CoNLL-U only.
val emptyDataSet = Seq.empty[String].toDF("text")
val pipelineModel = pipeline.fit(emptyDataSet)

Doc2Chunk

Converts DOCUMENT type annotations into CHUNK type with the contents of a chunkCol. Chunk text must be contained within input DOCUMENT. May be either StringType or ArrayType[StringType] (using setIsArray). Useful for annotators that require a CHUNK type input.

Input Annotator Types: DOCUMENT

Output Annotator Type: CHUNK

Python API: Doc2Chunk

Scala API: Doc2Chunk

Source: Doc2Chunk

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
chunkAssembler = Doc2Chunk() \
    .setInputCols("document") \
    .setChunkCol("target") \
    .setOutputCol("chunk") \
    .setIsArray(True)

data = spark.createDataFrame([[
    "Spark NLP is an open-source text processing library for advanced natural language processing.",
      ["Spark NLP", "text processing library", "natural language processing"]
]]).toDF("text", "target")

pipeline = Pipeline().setStages([documentAssembler, chunkAssembler]).fit(data)
result = pipeline.transform(data)

result.selectExpr("chunk.result", "chunk.annotatorType").show(truncate=False)
+-----------------------------------------------------------------+---------------------+
|result                                                           |annotatorType        |
+-----------------------------------------------------------------+---------------------+
|[Spark NLP, text processing library, natural language processing]|[chunk, chunk, chunk]|
+-----------------------------------------------------------------+---------------------+

import spark.implicits._
import com.johnsnowlabs.nlp.{Doc2Chunk, DocumentAssembler}
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document")
val chunkAssembler = new Doc2Chunk()
  .setInputCols("document")
  .setChunkCol("target")
  .setOutputCol("chunk")
  .setIsArray(true)

val data = Seq(
  ("Spark NLP is an open-source text processing library for advanced natural language processing.",
    Seq("Spark NLP", "text processing library", "natural language processing"))
).toDF("text", "target")

val pipeline = new Pipeline().setStages(Array(documentAssembler, chunkAssembler)).fit(data)
val result = pipeline.transform(data)

result.selectExpr("chunk.result", "chunk.annotatorType").show(false)
+-----------------------------------------------------------------+---------------------+
|result                                                           |annotatorType        |
+-----------------------------------------------------------------+---------------------+
|[Spark NLP, text processing library, natural language processing]|[chunk, chunk, chunk]|
+-----------------------------------------------------------------+---------------------+

Doc2Vec

Trains a Word2Vec model that creates vector representations of words in a text corpus.

The algorithm first constructs a vocabulary from the corpus and then learns vector representation of words in the vocabulary. The vector representation can be used as features in natural language processing and machine learning algorithms.

We use Word2Vec implemented in Spark ML. It uses skip-gram model in our implementation and a hierarchical softmax method to train the model. The variable names in the implementation match the original C implementation.

For instantiated/pretrained models, see Doc2VecModel.

Sources :

For the original C implementation, see https://code.google.com/p/word2vec/

For the research paper, see Efficient Estimation of Word Representations in Vector Space and Distributed Representations of Words and Phrases and their Compositionality.

Input Annotator Types: TOKEN

Output Annotator Type: SENTENCE_EMBEDDINGS

Python API: Doc2VecApproach

Scala API: Doc2VecApproach

Source: Doc2VecApproach

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

embeddings = Doc2VecApproach() \
    .setInputCols(["token"]) \
    .setOutputCol("embeddings")

pipeline = Pipeline() \
    .setStages([
      documentAssembler,
      tokenizer,
      embeddings
    ])

path = "sherlockholmes.txt"
dataset = spark.read.text(path).toDF("text")
pipelineModel = pipeline.fit(dataset)

import spark.implicits._
import com.johnsnowlabs.nlp.annotator.{Tokenizer, Doc2VecApproach}
import com.johnsnowlabs.nlp.base.DocumentAssembler
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols(Array("document"))
  .setOutputCol("token")

val embeddings = new Doc2VecApproach()
  .setInputCols("token")
  .setOutputCol("embeddings")

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    tokenizer,
    embeddings
  ))

val path = "src/test/resources/spell/sherlockholmes.txt"
val dataset = spark.sparkContext.textFile(path)
  .toDF("text")
val pipelineModel = pipeline.fit(dataset)

Word2Vec model that creates vector representations of words in a text corpus.

This is the instantiated model of the Doc2VecApproach. For training your own model, please see the documentation of that class.

Pretrained models can be loaded with pretrained of the companion object:

val embeddings = Doc2VecModel.pretrained()
  .setInputCols("token")
  .setOutputCol("embeddings")

The default model is "doc2vec_gigaword_300", if no name is provided.

For available pretrained models please see the Models Hub.

Sources :

For the original C implementation, see https://code.google.com/p/word2vec/

For the research paper, see Efficient Estimation of Word Representations in Vector Space and Distributed Representations of Words and Phrases and their Compositionality.

Input Annotator Types: TOKEN

Output Annotator Type: SENTENCE_EMBEDDINGS

Python API: Doc2VecModel

Scala API: Doc2VecModel

Source: Doc2VecModel

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

embeddings = Doc2VecModel.pretrained() \
    .setInputCols(["token"]) \
    .setOutputCol("embeddings")

embeddingsFinisher = EmbeddingsFinisher() \
    .setInputCols(["embeddings"]) \
    .setOutputCols("finished_embeddings") \
    .setOutputAsVector(True)

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    embeddings,
    embeddingsFinisher
])

data = spark.createDataFrame([["This is a sentence."]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("explode(finished_embeddings) as result").show(1, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[0.06222493574023247,0.011579325422644615,0.009919632226228714,0.109361454844...|
+--------------------------------------------------------------------------------+

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.{Tokenizer, Doc2VecModel}
import com.johnsnowlabs.nlp.EmbeddingsFinisher

import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols(Array("document"))
  .setOutputCol("token")

val embeddings = Doc2VecModel.pretrained()
  .setInputCols("token")
  .setOutputCol("embeddings")

val embeddingsFinisher = new EmbeddingsFinisher()
  .setInputCols("embeddings")
  .setOutputCols("finished_embeddings")
  .setOutputAsVector(true)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  tokenizer,
  embeddings,
  embeddingsFinisher
))

val data = Seq("This is a sentence.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(1, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[0.06222493574023247,0.011579325422644615,0.009919632226228714,0.109361454844...|
+--------------------------------------------------------------------------------+

DocumentAssembler

Prepares data into a format that is processable by Spark NLP. This is the entry point for every Spark NLP pipeline. The DocumentAssembler reads String columns. Additionally, setCleanupMode can be used to pre-process the text (Default: disabled). For possible options please refer the parameters section.

For more extended examples on document pre-processing see the Examples.

Input Annotator Types: NONE

Output Annotator Type: DOCUMENT

Python API: DocumentAssembler

Scala API: DocumentAssembler

Source: DocumentAssembler

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

data = spark.createDataFrame([["Spark NLP is an open-source text processing library."]]).toDF("text")
documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")

result = documentAssembler.transform(data)

result.select("document").show(truncate=False)
+----------------------------------------------------------------------------------------------+
|document                                                                                      |
+----------------------------------------------------------------------------------------------+
|[[document, 0, 51, Spark NLP is an open-source text processing library., [sentence -> 0], []]]|
+----------------------------------------------------------------------------------------------+

result.select("document").printSchema
root
 |-- document: array (nullable = True)
 |    |-- element: struct (containsNull = True)
 |    |    |-- annotatorType: string (nullable = True)
 |    |    |-- begin: integer (nullable = False)
 |    |    |-- end: integer (nullable = False)
 |    |    |-- result: string (nullable = True)
 |    |    |-- metadata: map (nullable = True)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = True)
 |    |    |-- embeddings: array (nullable = True)
 |    |    |    |-- element: float (containsNull = False)

import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler

val data = Seq("Spark NLP is an open-source text processing library.").toDF("text")
val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document")

val result = documentAssembler.transform(data)

result.select("document").show(false)
+----------------------------------------------------------------------------------------------+
|document                                                                                      |
+----------------------------------------------------------------------------------------------+
|[[document, 0, 51, Spark NLP is an open-source text processing library., [sentence -> 0], []]]|
+----------------------------------------------------------------------------------------------+

result.select("document").printSchema
root
 |-- document: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)

DocumentCharacterTextSplitter

Annotator which splits large documents into chunks of roughly given size.

DocumentCharacterTextSplitter takes a list of separators. It takes the separators in order and splits subtexts if they are over the chunk length, considering optional overlap of the chunks.

For example, given chunk size 20 and overlap 5:

"He was, I take it, the most perfect reasoning and observing machine that the world has seen."

["He was, I take it,", "it, the most", "most perfect", "reasoning and", "and observing", "machine that the", "the world has seen."]

Additionally, you can set

custom patterns with setSplitPatterns
whether patterns should be interpreted as regex with setPatternsAreRegex
whether to keep the separators with setKeepSeparators
whether to trim whitespaces with setTrimWhitespace
whether to explode the splits to individual rows with setExplodeSplits

For extended examples of usage, see the DocumentCharacterTextSplitterTest.

Input Annotator Types: DOCUMENT

Output Annotator Type: DOCUMENT

Python API: DocumentCharacterTextSplitter

Scala API: DocumentCharacterTextSplitter

Source: DocumentCharacterTextSplitter

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

textDF = spark.read.text(
   "sherlockholmes.txt",
   wholetext=True
).toDF("text")

documentAssembler = DocumentAssembler().setInputCol("text")

textSplitter = DocumentCharacterTextSplitter() \
    .setInputCols(["document"]) \
    .setOutputCol("splits") \
    .setChunkSize(20000) \
    .setChunkOverlap(200) \
    .setExplodeSplits(True)

pipeline = Pipeline().setStages([documentAssembler, textSplitter])
result = pipeline.fit(textDF).transform(textDF)
result.selectExpr(
      "splits.result",
      "splits[0].begin",
      "splits[0].end",
      "splits[0].end - splits[0].begin as length") \
    .show(8, truncate = 80)
+--------------------------------------------------------------------------------+---------------+-------------+------+
|                                                                          result|splits[0].begin|splits[0].end|length|
+--------------------------------------------------------------------------------+---------------+-------------+------+
|[ Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyl...|              0|        19994| 19994|
|["And Mademoiselle's address?" he asked.\n\n"Is Briony Lodge, Serpentine Aven...|          19798|        39395| 19597|
|["How did that help you?"\n\n"It was all-important. When a woman thinks that ...|          39371|        59242| 19871|
|["'But,' said I, 'there would be millions of red-headed men who\nwould apply....|          59166|        77833| 18667|
|[My friend was an enthusiastic musician, being himself not only a\nvery capab...|          77835|        97769| 19934|
|["And yet I am not convinced of it," I answered. "The cases which\ncome to li...|          97771|       117248| 19477|
|["Well, she had a slate-coloured, broad-brimmed straw hat, with a\nfeather of...|         117250|       137242| 19992|
|["That sounds a little paradoxical."\n\n"But it is profoundly True. Singulari...|         137244|       157171| 19927|
+--------------------------------------------------------------------------------+---------------+-------------+------+

import com.johnsnowlabs.nlp.annotator._
import com.johnsnowlabs.nlp.DocumentAssembler
import org.apache.spark.ml.Pipeline

val textDF =
  spark.read
    .option("wholetext", "true")
    .text("src/test/resources/spell/sherlockholmes.txt")
    .toDF("text")

val documentAssembler = new DocumentAssembler().setInputCol("text")
val textSplitter = new DocumentCharacterTextSplitter()
  .setInputCols("document")
  .setOutputCol("splits")
  .setChunkSize(20000)
  .setChunkOverlap(200)
  .setExplodeSplits(true)

val pipeline = new Pipeline().setStages(Array(documentAssembler, textSplitter))
val result = pipeline.fit(textDF).transform(textDF)

result
  .selectExpr(
    "splits.result",
    "splits[0].begin",
    "splits[0].end",
    "splits[0].end - splits[0].begin as length")
  .show(8, truncate = 80)
+--------------------------------------------------------------------------------+---------------+-------------+------+
|                                                                          result|splits[0].begin|splits[0].end|length|
+--------------------------------------------------------------------------------+---------------+-------------+------+
|[ Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyl...|              0|        19994| 19994|
|["And Mademoiselle's address?" he asked.\n\n"Is Briony Lodge, Serpentine Aven...|          19798|        39395| 19597|
|["How did that help you?"\n\n"It was all-important. When a woman thinks that ...|          39371|        59242| 19871|
|["'But,' said I, 'there would be millions of red-headed men who\nwould apply....|          59166|        77833| 18667|
|[My friend was an enthusiastic musician, being himself not only a\nvery capab...|          77835|        97769| 19934|
|["And yet I am not convinced of it," I answered. "The cases which\ncome to li...|          97771|       117248| 19477|
|["Well, she had a slate-coloured, broad-brimmed straw hat, with a\nfeather of...|         117250|       137242| 19992|
|["That sounds a little paradoxical."\n\n"But it is profoundly true. Singulari...|         137244|       157171| 19927|
+--------------------------------------------------------------------------------+---------------+-------------+------+

DocumentNormalizer

Annotator which normalizes raw text from tagged text, e.g. scraped web pages or xml documents, from document type columns into Sentence. Removes all dirty characters from text following one or more input regex patterns. Can apply not wanted character removal with a specific policy. Can apply lower case normalization.

For extended examples of usage, see the Examples.

Input Annotator Types: DOCUMENT

Output Annotator Type: DOCUMENT

Python API: DocumentNormalizer

Scala API: DocumentNormalizer

Source: DocumentNormalizer

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

cleanUpPatterns = ["<[^>]>"]

documentNormalizer = DocumentNormalizer() \
    .setInputCols("document") \
    .setOutputCol("normalizedDocument") \
    .setAction("clean") \
    .setPatterns(cleanUpPatterns) \
    .setReplacement(" ") \
    .setPolicy("pretty_all") \
    .setLowercase(True)

pipeline = Pipeline().setStages([
    documentAssembler,
    documentNormalizer
])

text = """
<div id="theworldsgreatest" class='my-right my-hide-small my-wide toptext' style="font-family:'Segoe UI',Arial,sans-serif">
    THE WORLD'S LARGEST WEB DEVELOPER SITE
    <h1 style="font-size:300%;">THE WORLD'S LARGEST WEB DEVELOPER SITE</h1>
    <p style="font-size:160%;">Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum..</p>
</div>

</div>"""
data = spark.createDataFrame([[text]]).toDF("text")
pipelineModel = pipeline.fit(data)

result = pipelineModel.transform(data)
result.selectExpr("normalizedDocument.result").show(truncate=False)
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[ the world's largest web developer site the world's largest web developer site lorem ipsum is simply dummy text of the printing and typesetting industry. lorem ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. it has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. it was popularised in the 1960s with the release of letraset sheets containing lorem ipsum passages, and more recently with desktop publishing software like aldus pagemaker including versions of lorem ipsum..]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.DocumentNormalizer
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val cleanUpPatterns = Array("<[^>]>")

val documentNormalizer = new DocumentNormalizer()
  .setInputCols("document")
  .setOutputCol("normalizedDocument")
  .setAction("clean")
  .setPatterns(cleanUpPatterns)
  .setReplacement(" ")
  .setPolicy("pretty_all")
  .setLowercase(true)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  documentNormalizer
))

val text =
  """
<div id="theworldsgreatest" class='my-right my-hide-small my-wide toptext' style="font-family:'Segoe UI',Arial,sans-serif">
  THE WORLD'S LARGEST WEB DEVELOPER SITE
  <h1 style="font-size:300%;">THE WORLD'S LARGEST WEB DEVELOPER SITE</h1>
  <p style="font-size:160%;">Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum..</p>
</div>

</div>"""
val data = Seq(text).toDF("text")
val pipelineModel = pipeline.fit(data)

val result = pipelineModel.transform(data)
result.selectExpr("normalizedDocument.result").show(truncate=false)
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[ the world's largest web developer site the world's largest web developer site lorem ipsum is simply dummy text of the printing and typesetting industry. lorem ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. it has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. it was popularised in the 1960s with the release of letraset sheets containing lorem ipsum passages, and more recently with desktop publishing software like aldus pagemaker including versions of lorem ipsum..]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

DocumentSimilarityRanker

Annotator that uses LSH techniques present in Spark ML lib to execute approximate nearest neighbors search on top of sentence embeddings.

It aims to capture the semantic meaning of a document in a dense, continuous vector space and return it to the ranker search.

For instantiated/pretrained models, see DocumentSimilarityRankerModel.

For extended examples of usage, see the jupyter notebook Document Similarity Ranker for Spark NLP.

Input Annotator Types: SENTENCE_EMBEDDINGS

Output Annotator Type: DOC_SIMILARITY_RANKINGS

Python API: DocumentSimilarityRankerApproach

Scala API: DocumentSimilarityRankerApproach

Source: DocumentSimilarityRankerApproach

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
from sparknlp.annotator.similarity.document_similarity_ranker import *

document_assembler = DocumentAssembler() \
            .setInputCol("text") \
            .setOutputCol("document")
sentence_embeddings = E5Embeddings.pretrained() \
            .setInputCols(["document"]) \
            .setOutputCol("sentence_embeddings")
document_similarity_ranker = DocumentSimilarityRankerApproach() \
            .setInputCols("sentence_embeddings") \
            .setOutputCol("doc_similarity_rankings") \
            .setSimilarityMethod("brp") \
            .setNumberOfNeighbours(1) \
            .setBucketLength(2.0) \
            .setNumHashTables(3) \
            .setVisibleDistances(True) \
            .setIdentityRanking(False)
document_similarity_ranker_finisher = DocumentSimilarityRankerFinisher() \
        .setInputCols("doc_similarity_rankings") \
        .setOutputCols(
            "finished_doc_similarity_rankings_id",
            "finished_doc_similarity_rankings_neighbors") \
        .setExtractNearestNeighbor(True)
pipeline = Pipeline(stages=[
            document_assembler,
            sentence_embeddings,
            document_similarity_ranker,
            document_similarity_ranker_finisher
        ])
docSimRankerPipeline = pipeline.fit(data).transform(data)

(
    docSimRankerPipeline
        .select(
               "finished_doc_similarity_rankings_id",
               "finished_doc_similarity_rankings_neighbors"
        ).show(10, False)
)
+-----------------------------------+------------------------------------------+
|finished_doc_similarity_rankings_id|finished_doc_similarity_rankings_neighbors|
+-----------------------------------+------------------------------------------+
|1510101612                         |[(1634839239,0.12448559591306324)]        |
|1634839239                         |[(1510101612,0.12448559591306324)]        |
|-612640902                         |[(1274183715,0.1220122862046063)]         |
|1274183715                         |[(-612640902,0.1220122862046063)]         |
|-1320876223                        |[(1293373212,0.17848855164122393)]        |
|1293373212                         |[(-1320876223,0.17848855164122393)]       |
|-1548374770                        |[(-1719102856,0.23297156732534166)]       |
|-1719102856                        |[(-1548374770,0.23297156732534166)]       |
+-----------------------------------+------------------------------------------+

import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import com.johnsnowlabs.nlp.annotators.similarity.DocumentSimilarityRankerApproach
import com.johnsnowlabs.nlp.finisher.DocumentSimilarityRankerFinisher
import org.apache.spark.ml.Pipeline

import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceEmbeddings = RoBertaSentenceEmbeddings
  .pretrained()
  .setInputCols("document")
  .setOutputCol("sentence_embeddings")

val documentSimilarityRanker = new DocumentSimilarityRankerApproach()
  .setInputCols("sentence_embeddings")
  .setOutputCol("doc_similarity_rankings")
  .setSimilarityMethod("brp")
  .setNumberOfNeighbours(1)
  .setBucketLength(2.0)
  .setNumHashTables(3)
  .setVisibleDistances(true)
  .setIdentityRanking(false)

val documentSimilarityRankerFinisher = new DocumentSimilarityRankerFinisher()
  .setInputCols("doc_similarity_rankings")
  .setOutputCols(
    "finished_doc_similarity_rankings_id",
    "finished_doc_similarity_rankings_neighbors")
  .setExtractNearestNeighbor(true)

// Let's use a dataset where we can visually control similarity
// Documents are coupled, as 1-2, 3-4, 5-6, 7-8 and they were create to be similar on purpose
val data = Seq(
  "First document, this is my first sentence. This is my second sentence.",
  "Second document, this is my second sentence. This is my second sentence.",
  "Third document, climate change is arguably one of the most pressing problems of our time.",
  "Fourth document, climate change is definitely one of the most pressing problems of our time.",
  "Fifth document, Florence in Italy, is among the most beautiful cities in Europe.",
  "Sixth document, Florence in Italy, is a very beautiful city in Europe like Lyon in France.",
  "Seventh document, the French Riviera is the Mediterranean coastline of the southeast corner of France.",
  "Eighth document, the warmest place in France is the French Riviera coast in Southern France.")
  .toDF("text")

val pipeline = new Pipeline().setStages(
  Array(
    documentAssembler,
    sentenceEmbeddings,
    documentSimilarityRanker,
    documentSimilarityRankerFinisher))

val result = pipeline.fit(data).transform(data)

result
  .select("finished_doc_similarity_rankings_id", "finished_doc_similarity_rankings_neighbors")
  .show(10, truncate = false)
+-----------------------------------+------------------------------------------+
|finished_doc_similarity_rankings_id|finished_doc_similarity_rankings_neighbors|
+-----------------------------------+------------------------------------------+
|1510101612                         |[(1634839239,0.12448559591306324)]        |
|1634839239                         |[(1510101612,0.12448559591306324)]        |
|-612640902                         |[(1274183715,0.1220122862046063)]         |
|1274183715                         |[(-612640902,0.1220122862046063)]         |
|-1320876223                        |[(1293373212,0.17848855164122393)]        |
|1293373212                         |[(-1320876223,0.17848855164122393)]       |
|-1548374770                        |[(-1719102856,0.23297156732534166)]       |
|-1719102856                        |[(-1548374770,0.23297156732534166)]       |
+-----------------------------------+------------------------------------------+

DocumentTokenSplitter

Annotator that splits large documents into smaller documents based on the number of tokens in the text.

Currently, DocumentTokenSplitter splits the text by whitespaces to create the tokens. The number of these tokens will then be used as a measure of the text length. In the future, other tokenization techniques will be supported.

For example, given 3 tokens and overlap 1:

"He was, I take it, the most perfect reasoning and observing machine that the world has seen."

["He was, I", "I take it,", "it, the most", "most perfect reasoning", "reasoning and observing", "observing machine that", "that the world", "world has seen."]

Additionally, you can set

whether to trim whitespaces with setTrimWhitespace
whether to explode the splits to individual rows with setExplodeSplits

For extended examples of usage, see the DocumentTokenSplitterTest.

Input Annotator Types: DOCUMENT

Output Annotator Type: DOCUMENT

Python API: DocumentTokenSplitter

Scala API: DocumentTokenSplitter

Source: DocumentTokenSplitter

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

textDF = spark.read.text(
   "sherlockholmes.txt",
   wholetext=True
).toDF("text")

documentAssembler = DocumentAssembler().setInputCol("text")

textSplitter = DocumentTokenSplitter() \
    .setInputCols(["document"]) \
    .setOutputCol("splits") \
    .setNumTokens(512) \
    .setTokenOverlap(10) \
    .setExplodeSplits(True)

pipeline = Pipeline().setStages([documentAssembler, textSplitter])

result = pipeline.fit(textDF).transform(textDF)
result.selectExpr(
      "splits.result as result",
      "splits[0].begin as begin",
      "splits[0].end as end",
      "splits[0].end - splits[0].begin as length",
      "splits[0].metadata.numTokens as tokens") \
    .show(8, truncate = 80)
+--------------------------------------------------------------------------------+-----+-----+------+------+
|                                                                          result|begin|  end|length|tokens|
+--------------------------------------------------------------------------------+-----+-----+------+------+
|[ Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyl...|    0| 3018|  3018|   512|
|[study of crime, and occupied his\nimmense faculties and extraordinary powers...| 2950| 5707|  2757|   512|
|[but as I have changed my clothes I can't imagine how you\ndeduce it. As to M...| 5659| 8483|  2824|   512|
|[quarters received. Be in your chamber then at that hour, and do\nnot take it...| 8427|11241|  2814|   512|
|[a pity\nto miss it."\n\n"But your client--"\n\n"Never mind him. I may want y...|11188|13970|  2782|   512|
|[person who employs me wishes his agent to be unknown to\nyou, and I may conf...|13918|16898|  2980|   512|
|[letters back."\n\n"Precisely so. But how--"\n\n"Was there a secret marriage?...|16836|19744|  2908|   512|
|[seven hundred in\nnotes," he said.\n\nHolmes scribbled a receipt upon a shee...|19683|22551|  2868|   512|
+--------------------------------------------------------------------------------+-----+-----+------+------+

import com.johnsnowlabs.nlp.annotator._
import com.johnsnowlabs.nlp.DocumentAssembler
import org.apache.spark.ml.Pipeline

val textDF =
  spark.read
    .option("wholetext", "true")
    .text("src/test/resources/spell/sherlockholmes.txt")
    .toDF("text")

val documentAssembler = new DocumentAssembler().setInputCol("text")
val textSplitter = new DocumentTokenSplitter()
  .setInputCols("document")
  .setOutputCol("splits")
  .setNumTokens(512)
  .setTokenOverlap(10)
  .setExplodeSplits(true)

val pipeline = new Pipeline().setStages(Array(documentAssembler, textSplitter))
val result = pipeline.fit(textDF).transform(textDF)

result
  .selectExpr(
    "splits.result as result",
    "splits[0].begin as begin",
    "splits[0].end as end",
    "splits[0].end - splits[0].begin as length",
    "splits[0].metadata.numTokens as tokens")
  .show(8, truncate = 80)
+--------------------------------------------------------------------------------+-----+-----+------+------+
|                                                                          result|begin|  end|length|tokens|
+--------------------------------------------------------------------------------+-----+-----+------+------+
|[ Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyl...|    0| 3018|  3018|   512|
|[study of crime, and occupied his\nimmense faculties and extraordinary powers...| 2950| 5707|  2757|   512|
|[but as I have changed my clothes I can't imagine how you\ndeduce it. As to M...| 5659| 8483|  2824|   512|
|[quarters received. Be in your chamber then at that hour, and do\nnot take it...| 8427|11241|  2814|   512|
|[a pity\nto miss it."\n\n"But your client--"\n\n"Never mind him. I may want y...|11188|13970|  2782|   512|
|[person who employs me wishes his agent to be unknown to\nyou, and I may conf...|13918|16898|  2980|   512|
|[letters back."\n\n"Precisely so. But how--"\n\n"Was there a secret marriage?...|16836|19744|  2908|   512|
|[seven hundred in\nnotes," he said.\n\nHolmes scribbled a receipt upon a shee...|19683|22551|  2868|   512|
+--------------------------------------------------------------------------------+-----+-----+------+------+

EmbeddingsFinisher

Extracts embeddings from Annotations into a more easily usable form.

This is useful for example: WordEmbeddings, BertEmbeddings, SentenceEmbeddings and ChunkEmbeddings.

By using EmbeddingsFinisher you can easily transform your embeddings into array of floats or vectors which are compatible with Spark ML functions such as LDA, K-mean, Random Forest classifier or any other functions that require featureCol.

For more extended examples see the Examples.

Input Annotator Types: EMBEDDINGS

Output Annotator Type: NONE

Python API: EmbeddingsFinisher

Scala API: EmbeddingsFinisher

Source: EmbeddingsFinisher

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols("document") \
    .setOutputCol("token")

normalizer = Normalizer() \
    .setInputCols("token") \
    .setOutputCol("normalized")

stopwordsCleaner = StopWordsCleaner() \
    .setInputCols("normalized") \
    .setOutputCol("cleanTokens") \
    .setCaseSensitive(False)

gloveEmbeddings = WordEmbeddingsModel.pretrained() \
    .setInputCols("document", "cleanTokens") \
    .setOutputCol("embeddings") \
    .setCaseSensitive(False)

embeddingsFinisher = EmbeddingsFinisher() \
    .setInputCols("embeddings") \
    .setOutputCols("finished_sentence_embeddings") \
    .setOutputAsVector(True) \
    .setCleanAnnotations(False)

data = spark.createDataFrame([["Spark NLP is an open-source text processing library."]]) \
    .toDF("text")
pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    normalizer,
    stopwordsCleaner,
    gloveEmbeddings,
    embeddingsFinisher
]).fit(data)

result = pipeline.transform(data)
resultWithSize = result.selectExpr("explode(finished_sentence_embeddings) as embeddings")

resultWithSize.show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                      embeddings|
+--------------------------------------------------------------------------------+
|[0.1619900017976761,0.045552998781204224,-0.03229299932718277,-0.685609996318...|
|[-0.42416998744010925,1.1378999948501587,-0.5717899799346924,-0.5078899860382...|
|[0.08621499687433243,-0.15772999823093414,-0.06067200005054474,0.395359992980...|
|[-0.4970499873161316,0.7164199948310852,0.40119001269340515,-0.05761000141501...|
|[-0.08170200139284134,0.7159299850463867,-0.20677000284194946,0.0295659992843...|
+--------------------------------------------------------------------------------+

import spark.implicits._
import org.apache.spark.ml.Pipeline
import com.johnsnowlabs.nlp.{DocumentAssembler, EmbeddingsFinisher}
import com.johnsnowlabs.nlp.annotator.{Normalizer, StopWordsCleaner, Tokenizer, WordEmbeddingsModel}

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val normalizer = new Normalizer()
  .setInputCols("token")
  .setOutputCol("normalized")

val stopwordsCleaner = new StopWordsCleaner()
  .setInputCols("normalized")
  .setOutputCol("cleanTokens")
  .setCaseSensitive(false)

val gloveEmbeddings = WordEmbeddingsModel.pretrained()
  .setInputCols("document", "cleanTokens")
  .setOutputCol("embeddings")
  .setCaseSensitive(false)

val embeddingsFinisher = new EmbeddingsFinisher()
  .setInputCols("embeddings")
  .setOutputCols("finished_sentence_embeddings")
  .setOutputAsVector(true)
  .setCleanAnnotations(false)

val data = Seq("Spark NLP is an open-source text processing library.")
  .toDF("text")
val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  tokenizer,
  normalizer,
  stopwordsCleaner,
  gloveEmbeddings,
  embeddingsFinisher
)).fit(data)

val result = pipeline.transform(data)
val resultWithSize = result.selectExpr("explode(finished_sentence_embeddings)")
  .map { row =>
    val vector = row.getAs[org.apache.spark.ml.linalg.DenseVector](0)
    (vector.size, vector)
  }.toDF("size", "vector")

resultWithSize.show(5, 80)
+----+--------------------------------------------------------------------------------+
|size|                                                                          vector|
+----+--------------------------------------------------------------------------------+
| 100|[0.1619900017976761,0.045552998781204224,-0.03229299932718277,-0.685609996318...|
| 100|[-0.42416998744010925,1.1378999948501587,-0.5717899799346924,-0.5078899860382...|
| 100|[0.08621499687433243,-0.15772999823093414,-0.06067200005054474,0.395359992980...|
| 100|[-0.4970499873161316,0.7164199948310852,0.40119001269340515,-0.05761000141501...|
| 100|[-0.08170200139284134,0.7159299850463867,-0.20677000284194946,0.0295659992843...|
+----+--------------------------------------------------------------------------------+

EntityRuler

Fits an Annotator to match exact strings or regex patterns provided in a file against a Document and assigns them an named entity. The definitions can contain any number of named entities.

There are multiple ways and formats to set the extraction resource. It is possible to set it either as a “JSON”, “JSONL” or “CSV” file. A path to the file needs to be provided to setPatternsResource. The file format needs to be set as the “format” field in the option parameter map and depending on the file type, additional parameters might need to be set.

If the file is in a JSON format, then the rule definitions need to be given in a list with the fields “id”, “label” and “patterns”:

 [
  {
    "id": "person-regex",
    "label": "PERSON",
    "patterns": ["\\w+\\s\\w+", "\\w+-\\w+"]
  },
  {
    "id": "locations-words",
    "label": "LOCATION",
    "patterns": ["Winterfell"]
  }
]

The same fields also apply to a file in the JSONL format:

{"id": "names-with-j", "label": "PERSON", "patterns": ["Jon", "John", "John Snow"]}
{"id": "names-with-s", "label": "PERSON", "patterns": ["Stark", "Snow"]}
{"id": "names-with-e", "label": "PERSON", "patterns": ["Eddard", "Eddard Stark"]}

In order to use a CSV file, an additional parameter “delimiter” needs to be set. In this case, the delimiter might be set by using .setPatternsResource("patterns.csv", ReadAs.TEXT, Map("format"->"csv", "delimiter" -> "\\|"))

PERSON|Jon
PERSON|John
PERSON|John Snow
LOCATION|Winterfell

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: CHUNK

Python API: EntityRulerApproach

Scala API: EntityRulerApproach

Source: EntityRulerApproach

Show Example

# In this example, the entities file as the form of
#
# PERSON|Jon
# PERSON|John
# PERSON|John Snow
# LOCATION|Winterfell
#
# where each line represents an entity and the associated string delimited by "|".

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.common import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")
tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")
entityRuler = EntityRulerApproach() \
    .setInputCols(["document", "token"]) \
    .setOutputCol("entities") \
    .setPatternsResource(
      "patterns.csv",
      ReadAs.TEXT,
      {"format": "csv", "delimiter": "\\|"}
    )
pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    entityRuler
])
data = spark.createDataFrame([["Jon Snow wants to be lord of Winterfell."]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("explode(entities)").show(truncate=False)
+--------------------------------------------------------------------+
|col                                                                 |
+--------------------------------------------------------------------+
|[chunk, 0, 2, Jon, [entity -> PERSON, sentence -> 0], []]           |
|[chunk, 29, 38, Winterfell, [entity -> LOCATION, sentence -> 0], []]|
+--------------------------------------------------------------------+

// In this example, the entities file as the form of
//
// PERSON|Jon
// PERSON|John
// PERSON|John Snow
// LOCATION|Winterfell
//
// where each line represents an entity and the associated string delimited by "|".

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.er.EntityRulerApproach
import com.johnsnowlabs.nlp.util.io.ReadAs

import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val entityRuler = new EntityRulerApproach()
  .setInputCols("document", "token")
  .setOutputCol("entities")
  .setPatternsResource(
    "src/test/resources/entity-ruler/patterns.csv",
    ReadAs.TEXT,
    {"format": "csv", "delimiter": "|")}
  )

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  tokenizer,
  entityRuler
))

val data = Seq("Jon Snow wants to be lord of Winterfell.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(entities)").show(false)
+--------------------------------------------------------------------+
|col                                                                 |
+--------------------------------------------------------------------+
|[chunk, 0, 2, Jon, [entity -> PERSON, sentence -> 0], []]           |
|[chunk, 29, 38, Winterfell, [entity -> LOCATION, sentence -> 0], []]|
+--------------------------------------------------------------------+

Finisher

Converts annotation results into a format that easier to use. It is useful to extract the results from Spark NLP Pipelines. The Finisher outputs annotation(s) values into String.

For more extended examples on document pre-processing see the Examples.

Input Annotator Types: ANY

Output Annotator Type: NONE

Python API: Finisher

Scala API: Finisher

Source: Finisher

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
from sparknlp.pretrained import PretrainedPipeline

data = spark.createDataFrame([[1, "New York and New Jersey aren't that far apart actually."]]).toDF("id", "text")

# Extracts Named Entities amongst other things
pipeline = PretrainedPipeline("explain_document_dl")

finisher = Finisher().setInputCols("entities").setOutputCols("output")
explainResult = pipeline.transform(data)

explainResult.selectExpr("explode(entities)").show(truncate=False)
+------------------------------------------------------------------------------------------------------------------------------------------------------+
|entities                                                                                                                                              |
+------------------------------------------------------------------------------------------------------------------------------------------------------+
|[[chunk, 0, 7, New York, [entity -> LOC, sentence -> 0, chunk -> 0], []], [chunk, 13, 22, New Jersey, [entity -> LOC, sentence -> 0, chunk -> 1], []]]|
+------------------------------------------------------------------------------------------------------------------------------------------------------+

result = finisher.transform(explainResult)
result.select("output").show(truncate=False)
+----------------------+
|output                |
+----------------------+
|[New York, New Jersey]|
+----------------------+

import spark.implicits._
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
import com.johnsnowlabs.nlp.Finisher

val data = Seq((1, "New York and New Jersey aren't that far apart actually.")).toDF("id", "text")

// Extracts Named Entities amongst other things
val pipeline = PretrainedPipeline("explain_document_dl")

val finisher = new Finisher().setInputCols("entities").setOutputCols("output")
val explainResult = pipeline.transform(data)

explainResult.selectExpr("explode(entities)").show(false)
+------------------------------------------------------------------------------------------------------------------------------------------------------+
|entities                                                                                                                                              |
+------------------------------------------------------------------------------------------------------------------------------------------------------+
|[[chunk, 0, 7, New York, [entity -> LOC, sentence -> 0, chunk -> 0], []], [chunk, 13, 22, New Jersey, [entity -> LOC, sentence -> 0, chunk -> 1], []]]|
+------------------------------------------------------------------------------------------------------------------------------------------------------+

val result = finisher.transform(explainResult)
result.select("output").show(false)
+----------------------+
|output                |
+----------------------+
|[New York, New Jersey]|
+----------------------+

GraphExtraction

Extracts a dependency graph between entities.

The GraphExtraction class takes e.g. extracted entities from a NerDLModel and creates a dependency tree which describes how the entities relate to each other. For that a triple store format is used. Nodes represent the entities and the edges represent the relations between those entities. The graph can then be used to find relevant relationships between words.

Both the DependencyParserModel and TypedDependencyParserModel need to be present in the pipeline. There are two ways to set them:

Both Annotators are present in the pipeline already. The dependencies are taken implicitly from these two Annotators.

Setting setMergeEntities to true will download the default pretrained models for those two Annotators automatically. The specific models can also be set with setDependencyParserModel and setTypedDependencyParserModel:

      val graph_extraction = new GraphExtraction()
        .setInputCols("document", "token", "ner")
        .setOutputCol("graph")
        .setRelationshipTypes(Array("prefer-LOC"))
        .setMergeEntities(true)
      //.setDependencyParserModel(Array("dependency_conllu", "en",  "public/models"))
      //.setTypedDependencyParserModel(Array("dependency_typed_conllu", "en",  "public/models"))

To transform the resulting graph into a more generic form such as RDF, see the GraphFinisher.

Input Annotator Types: DOCUMENT, TOKEN, NAMED_ENTITY

Output Annotator Type: NODE

Python API: GraphExtraction

Scala API: GraphExtraction

Source: GraphExtraction

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

embeddings = WordEmbeddingsModel.pretrained() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

nerTagger = NerDLModel.pretrained() \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

posTagger = PerceptronModel.pretrained() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("pos")

dependencyParser = DependencyParserModel.pretrained() \
    .setInputCols(["sentence", "pos", "token"]) \
    .setOutputCol("dependency")

typedDependencyParser = TypedDependencyParserModel.pretrained() \
    .setInputCols(["dependency", "pos", "token"]) \
    .setOutputCol("dependency_type")

graph_extraction = GraphExtraction() \
    .setInputCols(["document", "token", "ner"]) \
    .setOutputCol("graph") \
    .setRelationshipTypes(["prefer-LOC"])

pipeline = Pipeline().setStages([
    documentAssembler,
    sentence,
    tokenizer,
    embeddings,
    nerTagger,
    posTagger,
    dependencyParser,
    typedDependencyParser,
    graph_extraction
])

data = spark.createDataFrame([["You and John prefer the morning flight through Denver"]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.select("graph").show(truncate=False)
+-----------------------------------------------------------------------------------------------------------------+
|graph                                                                                                            |
+-----------------------------------------------------------------------------------------------------------------+
|13, 18, prefer, [relationship -> prefer,LOC, path1 -> prefer,nsubj,morning,flat,flight,flat,Denver], []|
+-----------------------------------------------------------------------------------------------------------------+

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.ner.dl.NerDLModel
import com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel
import com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel
import com.johnsnowlabs.nlp.annotators.parser.dep.DependencyParserModel
import com.johnsnowlabs.nlp.annotators.parser.typdep.TypedDependencyParserModel
import org.apache.spark.ml.Pipeline
import com.johnsnowlabs.nlp.annotators.GraphExtraction

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentence = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

val embeddings = WordEmbeddingsModel.pretrained()
  .setInputCols("sentence", "token")
  .setOutputCol("embeddings")

val nerTagger = NerDLModel.pretrained()
  .setInputCols("sentence", "token", "embeddings")
  .setOutputCol("ner")

val posTagger = PerceptronModel.pretrained()
  .setInputCols("sentence", "token")
  .setOutputCol("pos")

val dependencyParser = DependencyParserModel.pretrained()
  .setInputCols("sentence", "pos", "token")
  .setOutputCol("dependency")

val typedDependencyParser = TypedDependencyParserModel.pretrained()
  .setInputCols("dependency", "pos", "token")
  .setOutputCol("dependency_type")

val graph_extraction = new GraphExtraction()
  .setInputCols("document", "token", "ner")
  .setOutputCol("graph")
  .setRelationshipTypes(Array("prefer-LOC"))

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentence,
  tokenizer,
  embeddings,
  nerTagger,
  posTagger,
  dependencyParser,
  typedDependencyParser,
  graph_extraction
))

val data = Seq("You and John prefer the morning flight through Denver").toDF("text")
val result = pipeline.fit(data).transform(data)

result.select("graph").show(false)
+-----------------------------------------------------------------------------------------------------------------+
|graph                                                                                                            |
+-----------------------------------------------------------------------------------------------------------------+
|[[node, 13, 18, prefer, [relationship -> prefer,LOC, path1 -> prefer,nsubj,morning,flat,flight,flat,Denver], []]]|
+-----------------------------------------------------------------------------------------------------------------+

GraphFinisher

Helper class to convert the knowledge graph from GraphExtraction into a generic format, such as RDF.

Input Annotator Types: NONE

Output Annotator Type: NONE

Python API: GraphFinisher

Scala API: GraphFinisher

Source: GraphFinisher

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
# This is a continuation of the example of
# GraphExtraction. To see how the graph is extracted, see the
# documentation of that class.

graphFinisher = GraphFinisher() \
    .setInputCol("graph") \
    .setOutputCol("graph_finished")
    .setOutputAs[False]

finishedResult = graphFinisher.transform(result)
finishedResult.select("text", "graph_finished").show(truncate=False)
+-----------------------------------------------------+-----------------------------------------------------------------------+
|text                                                 |graph_finished                                                         |
+-----------------------------------------------------+-----------------------------------------------------------------------+
|You and John prefer the morning flight through Denver|(morning,flat,flight), (flight,flat,Denver)|
+-----------------------------------------------------+-----------------------------------------------------------------------+

// This is a continuation of the example of
// [[com.johnsnowlabs.nlp.annotators.GraphExtraction GraphExtraction]]. To see how the graph is extracted, see the
// documentation of that class.
import com.johnsnowlabs.nlp.GraphFinisher

val graphFinisher = new GraphFinisher()
  .setInputCol("graph")
  .setOutputCol("graph_finished")
  .setOutputAsArray(false)

val finishedResult = graphFinisher.transform(result)
finishedResult.select("text", "graph_finished").show(false)
+-----------------------------------------------------+-----------------------------------------------------------------------+
|text                                                 |graph_finished                                                         |
+-----------------------------------------------------+-----------------------------------------------------------------------+
|You and John prefer the morning flight through Denver|[[(prefer,nsubj,morning), (morning,flat,flight), (flight,flat,Denver)]]|
+-----------------------------------------------------+-----------------------------------------------------------------------+

ImageAssembler

Prepares images read by Spark into a format that is processable by Spark NLP. This component is needed to process images.

Input Annotator Types: NONE

Output Annotator Type: IMAGE

Python API: ImageAssembler

Scala API: ImageAssembler

Source: ImageAssembler

Show Example

import sparknlp
from sparknlp.base import *
from pyspark.ml import Pipeline

data = spark.read.format("image").load("./tmp/images/").toDF("image")
imageAssembler = ImageAssembler().setInputCol("image").setOutputCol("image_assembler")

result = imageAssembler.transform(data)
result.select("image_assembler").show()
result.select("image_assembler").printSchema()
root
  |-- image_assembler: array (nullable = true)
  |    |-- element: struct (containsNull = true)
  |    |    |-- annotatorType: string (nullable = true)
  |    |    |-- origin: string (nullable = true)
  |    |    |-- height: integer (nullable = true)
  |    |    |-- width: integer (nullable = true)
  |    |    |-- nChannels: integer (nullable = true)
  |    |    |-- mode: integer (nullable = true)
  |    |    |-- result: binary (nullable = true)
  |    |    |-- metadata: map (nullable = true)
  |    |    |    |-- key: string
  |    |    |    |-- value: string (valueContainsNull = true)

import com.johnsnowlabs.nlp.ImageAssembler
import org.apache.spark.ml.Pipeline

val imageDF: DataFrame = spark.read
  .format("image")
  .option("dropInvalid", value = true)
  .load("src/test/resources/image/")

val imageAssembler = new ImageAssembler()
  .setInputCol("image")
  .setOutputCol("image_assembler")

val pipeline = new Pipeline().setStages(Array(imageAssembler))
val pipelineDF = pipeline.fit(imageDF).transform(imageDF)
pipelineDF.printSchema()
root
 |-- image_assembler: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- origin: string (nullable = true)
 |    |    |-- height: integer (nullable = false)
 |    |    |-- width: integer (nullable = false)
 |    |    |-- nChannels: integer (nullable = false)
 |    |    |-- mode: integer (nullable = false)
 |    |    |-- result: binary (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)

LanguageDetectorDL

Language Identification and Detection by using CNN and RNN architectures in TensorFlow.

LanguageDetectorDL is an annotator that detects the language of documents or sentences depending on the inputCols. The models are trained on large datasets such as Wikipedia and Tatoeba. Depending on the language (how similar the characters are), the LanguageDetectorDL works best with text longer than 140 characters. The output is a language code in Wiki Code style.

Pretrained models can be loaded with pretrained of the companion object:

Val languageDetector = LanguageDetectorDL.pretrained()
  .setInputCols("sentence")
  .setOutputCol("language")

The default model is "ld_wiki_tatoeba_cnn_21", default language is "xx" (meaning multi-lingual), if no values are provided. For available pretrained models please see the Models Hub.

For extended examples of usage, see the Examples And the LanguageDetectorDLTestSpec.

Input Annotator Types: DOCUMENT

Output Annotator Type: LANGUAGE

Python API: LanguageDetectorDL

Scala API: LanguageDetectorDL

Source: LanguageDetectorDL

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

languageDetector = LanguageDetectorDL.pretrained() \
    .setInputCols("document") \
    .setOutputCol("language")

pipeline = Pipeline() \
    .setStages([
      documentAssembler,
      languageDetector
    ])

data = spark.createDataFrame([
    ["Spark NLP is an open-source text processing library for advanced natural language processing for the Python, Java and Scala programming languages."],
    ["Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala."],
    ["Spark NLP ist eine Open-Source-Textverarbeitungsbibliothek für fortgeschrittene natürliche Sprachverarbeitung für die Programmiersprachen Python, Java und Scala."]
]).toDF("text")
result = pipeline.fit(data).transform(data)

result.select("language.result").show(truncate=False)
+------+
|result|
+------+
|[en]  |
|[fr]  |
|[de]  |
+------+

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.ld.dl.LanguageDetectorDL
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val languageDetector = LanguageDetectorDL.pretrained()
  .setInputCols("document")
  .setOutputCol("language")

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    languageDetector
  ))

val data = Seq(
  "Spark NLP is an open-source text processing library for advanced natural language processing for the Python, Java and Scala programming languages.",
  "Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala.",
  "Spark NLP ist eine Open-Source-Textverarbeitungsbibliothek für fortgeschrittene natürliche Sprachverarbeitung für die Programmiersprachen Python, Java und Scala."
).toDF("text")
val result = pipeline.fit(data).transform(data)

result.select("language.result").show(false)
+------+
|result|
+------+
|[en]  |
|[fr]  |
|[de]  |
+------+

Lemmatizer

Class to find lemmas out of words with the objective of returning a base dictionary word. Retrieves the significant part of a word. A dictionary of predefined lemmas must be provided with setDictionary. The dictionary can be set as a delimited text file. Pretrained models can be loaded with LemmatizerModel.pretrained.

For available pretrained models please see the Models Hub. For extended examples of usage, see the Examples.

Input Annotator Types: TOKEN

Output Annotator Type: TOKEN

Python API: Lemmatizer

Scala API: Lemmatizer

Source: Lemmatizer

Show Example

# In this example, the lemma dictionary `lemmas_small.txt` has the form of
#
# ...
# pick	->	pick	picks	picking	picked
# peck	->	peck	pecking	pecked	pecks
# pickle	->	pickle	pickles	pickled	pickling
# pepper	->	pepper	peppers	peppered	peppering
# ...
#
# where each key is delimited by `->` and values are delimited by `\t`

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

lemmatizer = Lemmatizer() \
    .setInputCols(["token"]) \
    .setOutputCol("lemma") \
    .setDictionary("src/test/resources/lemma-corpus-small/lemmas_small.txt", "->", "\t")

pipeline = Pipeline() \
    .setStages([
      documentAssembler,
      sentenceDetector,
      tokenizer,
      lemmatizer
    ])

data = spark.createDataFrame([["Peter Pipers employees are picking pecks of pickled peppers."]]) \
    .toDF("text")

result = pipeline.fit(data).transform(data)
result.selectExpr("lemma.result").show(truncate=False)
+------------------------------------------------------------------+
|result                                                            |
+------------------------------------------------------------------+
|[Peter, Pipers, employees, are, pick, peck, of, pickle, pepper, .]|
+------------------------------------------------------------------+

// In this example, the lemma dictionary `lemmas_small.txt` has the form of
//
// ...
// pick	->	pick	picks	picking	picked
// peck	->	peck	pecking	pecked	pecks
// pickle	->	pickle	pickles	pickled	pickling
// pepper	->	pepper	peppers	peppered	peppering
// ...
//
// where each key is delimited by `->` and values are delimited by `\t`
import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.Tokenizer
import com.johnsnowlabs.nlp.annotator.SentenceDetector
import com.johnsnowlabs.nlp.annotators.Lemmatizer
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val lemmatizer = new Lemmatizer()
  .setInputCols(Array("token"))
  .setOutputCol("lemma")
  .setDictionary("src/test/resources/lemma-corpus-small/lemmas_small.txt", "->", "\t")

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    sentenceDetector,
    tokenizer,
    lemmatizer
  ))

val data = Seq("Peter Pipers employees are picking pecks of pickled peppers.")
  .toDF("text")

val result = pipeline.fit(data).transform(data)
result.selectExpr("lemma.result").show(false)
+------------------------------------------------------------------+
|result                                                            |
+------------------------------------------------------------------+
|[Peter, Pipers, employees, are, pick, peck, of, pickle, pepper, .]|
+------------------------------------------------------------------+

MultiClassifierDL

Trains a MultiClassifierDL for Multi-label Text Classification.

MultiClassifierDL uses a Bidirectional GRU with a convolutional model that we have built inside TensorFlow and supports up to 100 classes.

For instantiated/pretrained models, see MultiClassifierDLModel.

The input to MultiClassifierDL are Sentence Embeddings such as the state-of-the-art UniversalSentenceEncoder, BertSentenceEmbeddings or SentenceEmbeddings.

In machine learning, multi-label classification and the strongly related problem of multi-output classification are variants of the classification problem where multiple labels may be assigned to each instance. Multi-label classification is a generalization of multiclass classification, which is the single-label problem of categorizing instances into precisely one of more than two classes; in the multi-label problem there is no constraint on how many of the classes the instance can be assigned to. Formally, multi-label classification is the problem of finding a model that maps inputs x to binary vectors y (assigning a value of 0 or 1 for each element (label) in y).

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val embeddings = UniversalSentenceEncoder.pretrained()
  .setInputCols("document")
  .setOutputCol("sentence_embeddings")

val preProcessingPipeline = new Pipeline().setStages(Array(documentAssembler, embeddings))

val Array(train, test) = data.randomSplit(Array(0.8, 0.2))
preProcessingPipeline
  .fit(test)
  .transform(test)
  .write
  .mode("overwrite")
  .parquet("test_data")

val multiClassifier = new MultiClassifierDLApproach()
  .setInputCols("sentence_embeddings")
  .setOutputCol("category")
  .setLabelColumn("label")
  .setTestDataset("test_data")

For extended examples of usage, see the Examples and the MultiClassifierDLTestSpec.

Input Annotator Types: SENTENCE_EMBEDDINGS

Output Annotator Type: CATEGORY

Note: This annotator accepts a label column of a single item in either type of String, Int, Float, or Double. UniversalSentenceEncoder, BertSentenceEmbeddings, SentenceEmbeddings or other sentence based embeddings can be used for the inputCol

Python API: MultiClassifierDLApproach

Scala API: MultiClassifierDLApproach

Source: MultiClassifierDLApproach

Show Example

# In this example, the training data has the form
#
# +----------------+--------------------+--------------------+
# |              id|                text|              labels|
# +----------------+--------------------+--------------------+
# |ed58abb40640f983|PN NewsYou mean ... |             [toxic]|
# |a1237f726b5f5d89|Dude.  Place the ...|   [obscene, insult]|
# |24b0d6c8733c2abe|Thanks  - thanks ...|            [insult]|
# |8c4478fb239bcfc0|" Gee, 5 minutes ...|[toxic, obscene, ...|
# +----------------+--------------------+--------------------+

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

# Process training data to create text with associated array of labels

trainDataset.printSchema()
# root
#  |-- id: string (nullable = true)
#  |-- text: string (nullable = true)
#  |-- labels: array (nullable = true)
#  |    |-- element: string (containsNull = true)


# Then create pipeline for training
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document") \
    .setCleanupMode("shrink")

embeddings = UniversalSentenceEncoder.pretrained() \
    .setInputCols("document") \
    .setOutputCol("embeddings")

docClassifier = MultiClassifierDLApproach() \
    .setInputCols("embeddings") \
    .setOutputCol("category") \
    .setLabelColumn("labels") \
    .setBatchSize(128) \
    .setMaxEpochs(10) \
    .setLr(1e-3) \
    .setThreshold(0.5) \
    .setValidationSplit(0.1)

pipeline = Pipeline() \
    .setStages(
      [
        documentAssembler,
        embeddings,
        docClassifier
      ]
    )

pipelineModel = pipeline.fit(trainDataset)

// In this example, the training data has the form (Note: labels can be arbitrary)
//
// mr,ref
// "name[Alimentum], area[city centre], familyFriendly[no], near[Burger King]",Alimentum is an adult establish found in the city centre area near Burger King.
// "name[Alimentum], area[city centre], familyFriendly[yes]",Alimentum is a family-friendly place in the city centre.
// ...
//
// It needs some pre-processing first, so the labels are of type `Array[String]`. This can be done like so:

import spark.implicits._
import com.johnsnowlabs.nlp.annotators.classifier.dl.MultiClassifierDLApproach
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.embeddings.UniversalSentenceEncoder
import org.apache.spark.ml.Pipeline
import org.apache.spark.sql.functions.{col, udf}

// Process training data to create text with associated array of labels
def splitAndTrim = udf { labels: String =>
  labels.split(", ").map(x=>x.trim)
}

val smallCorpus = spark.read
  .option("header", true)
  .option("inferSchema", true)
  .option("mode", "DROPMALFORMED")
  .csv("src/test/resources/classifier/e2e.csv")
  .withColumn("labels", splitAndTrim(col("mr")))
  .withColumn("text", col("ref"))
  .drop("mr")

smallCorpus.printSchema()
// root
// |-- ref: string (nullable = true)
// |-- labels: array (nullable = true)
// |    |-- element: string (containsNull = true)

// Then create pipeline for training
val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")
  .setCleanupMode("shrink")

val embeddings = UniversalSentenceEncoder.pretrained()
  .setInputCols("document")
  .setOutputCol("embeddings")

val docClassifier = new MultiClassifierDLApproach()
  .setInputCols("embeddings")
  .setOutputCol("category")
  .setLabelColumn("labels")
  .setBatchSize(128)
  .setMaxEpochs(10)
  .setLr(1e-3f)
  .setThreshold(0.5f)
  .setValidationSplit(0.1f)

val pipeline = new Pipeline()
  .setStages(
    Array(
      documentAssembler,
      embeddings,
      docClassifier
    )
  )

val pipelineModel = pipeline.fit(smallCorpus)

MultiClassifierDL for Multi-label Text Classification.

MultiClassifierDL Bidirectional GRU with Convolution model we have built inside TensorFlow and supports up to 100 classes. The input to MultiClassifierDL are Sentence Embeddings such as state-of-the-art UniversalSentenceEncoder, BertSentenceEmbeddings or SentenceEmbeddings.

This is the instantiated model of the MultiClassifierDLApproach. For training your own model, please see the documentation of that class.

Pretrained models can be loaded with pretrained of the companion object:

val multiClassifier = MultiClassifierDLModel.pretrained()
  .setInputCols("sentence_embeddings")
  .setOutputCol("categories")

The default model is "multiclassifierdl_use_toxic", if no name is provided. It uses embeddings from the UniversalSentenceEncoder and classifies toxic comments. The data is based on the Jigsaw Toxic Comment Classification Challenge. For available pretrained models please see the Models Hub.

For extended examples of usage, see the Examples and the MultiClassifierDLTestSpec.

Input Annotator Types: SENTENCE_EMBEDDINGS

Output Annotator Type: CATEGORY

Python API: MultiClassifierDLModel

Scala API: MultiClassifierDLModel

Source: MultiClassifierDLModel

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

useEmbeddings = UniversalSentenceEncoder.pretrained() \
    .setInputCols("document") \
    .setOutputCol("sentence_embeddings")

multiClassifierDl = MultiClassifierDLModel.pretrained() \
    .setInputCols("sentence_embeddings") \
    .setOutputCol("classifications")

pipeline = Pipeline() \
    .setStages([
      documentAssembler,
      useEmbeddings,
      multiClassifierDl
    ])

data = spark.createDataFrame([
    ["This is pretty good stuff!"],
    ["Wtf kind of crap is this"]
]).toDF("text")
result = pipeline.fit(data).transform(data)

result.select("text", "classifications.result").show(truncate=False)
+--------------------------+----------------+
|text                      |result          |
+--------------------------+----------------+
|This is pretty good stuff!|[]              |
|Wtf kind of crap is this  |[toxic, obscene]|
+--------------------------+----------------+

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.classifier.dl.MultiClassifierDLModel
import com.johnsnowlabs.nlp.embeddings.UniversalSentenceEncoder
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val useEmbeddings = UniversalSentenceEncoder.pretrained()
  .setInputCols("document")
  .setOutputCol("sentence_embeddings")

val multiClassifierDl = MultiClassifierDLModel.pretrained()
  .setInputCols("sentence_embeddings")
  .setOutputCol("classifications")

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    useEmbeddings,
    multiClassifierDl
  ))

val data = Seq(
  "This is pretty good stuff!",
  "Wtf kind of crap is this"
).toDF("text")
val result = pipeline.fit(data).transform(data)

result.select("text", "classifications.result").show(false)
+--------------------------+----------------+
|text                      |result          |
+--------------------------+----------------+
|This is pretty good stuff!|[]              |
|Wtf kind of crap is this  |[toxic, obscene]|
+--------------------------+----------------+

MultiDateMatcher

Matches standard date formats into a provided format.

Reads the following kind of dates:

"1978-01-28", "1984/04/02,1/02/1980", "2/28/79", "The 31st of April in the year 2008",
"Fri, 21 Nov 1997", "Jan 21, ‘97", "Sun", "Nov 21", "jan 1st", "next thursday",
"last wednesday", "today", "tomorrow", "yesterday", "next week", "next month",
"next year", "day after", "the day before", "0600h", "06:00 hours", "6pm", "5:30 a.m.",
"at 5", "12:59", "23:59", "1988/11/23 6pm", "next week at 7.30", "5 am tomorrow"

For example "The 31st of April in the year 2008" will be converted into 2008/04/31.

For extended examples of usage, see the Examples and the MultiDateMatcherTestSpec.

Input Annotator Types: DOCUMENT

Output Annotator Type: DATE

Python API: MultiDateMatcher

Scala API: MultiDateMatcher

Source: MultiDateMatcher

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

date = MultiDateMatcher() \
    .setInputCols("document") \
    .setOutputCol("date") \
    .setAnchorDateYear(2020) \
    .setAnchorDateMonth(1) \
    .setAnchorDateDay(11) \
    .setDateFormat("yyyy/MM/dd")

pipeline = Pipeline().setStages([
    documentAssembler,
    date
])

data = spark.createDataFrame([["I saw him yesterday and he told me that he will visit us next week"]]) \
    .toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("explode(date) as dates").show(truncate=False)
+-----------------------------------------------+
|dates                                          |
+-----------------------------------------------+
|[date, 57, 65, 2020/01/18, [sentence -> 0], []]|
|[date, 10, 18, 2020/01/10, [sentence -> 0], []]|
+-----------------------------------------------+

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.MultiDateMatcher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val date = new MultiDateMatcher()
  .setInputCols("document")
  .setOutputCol("date")
  .setAnchorDateYear(2020)
  .setAnchorDateMonth(1)
  .setAnchorDateDay(11)
  .setDateFormat("yyyy/MM/dd")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  date
))

val data = Seq("I saw him yesterday and he told me that he will visit us next week")
  .toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(date) as dates").show(false)
+-----------------------------------------------+
|dates                                          |
+-----------------------------------------------+
|[date, 57, 65, 2020/01/18, [sentence -> 0], []]|
|[date, 10, 18, 2020/01/10, [sentence -> 0], []]|
+-----------------------------------------------+

MultiDocumentAssembler

Prepares data into a format that is processable by Spark NLP. This is the entry point for every Spark NLP pipeline. The MultiDocumentAssembler can read either a String column or an Array[String]. Additionally, MultiDocumentAssembler.setCleanupMode can be used to pre-process the text (Default: disabled). For possible options please refer the parameters section.

For more extended examples on document pre-processing see the Examples.

Input Annotator Types: NONE

Output Annotator Type: DOCUMENT

Python API: MultiDocumentAssembler

Scala API: MultiDocumentAssembler

Source: MultiDocumentAssembler

Show Example

import sparknlp

from sparknlp.base import *

from pyspark.ml import Pipeline

data = spark.createDataFrame([["Spark NLP is an open-source text processing library."], ["Spark NLP is a state-of-the-art Natural Language Processing library built on top of Apache Spark"]]).toDF("text", "text2")

documentAssembler = MultiDocumentAssembler().setInputCols(["text", "text2"]).setOutputCols(["document1", "document2"])

result = documentAssembler.transform(data)

result.select("document1").show(truncate=False)
+----------------------------------------------------------------------------------------------+
|document1                                                                                      |
+----------------------------------------------------------------------------------------------+
|[[document, 0, 51, Spark NLP is an open-source text processing library., [sentence -> 0], []]]|
+----------------------------------------------------------------------------------------------+

result.select("document1").printSchema()
root
|-- document: array (nullable = True)
|    |-- element: struct (containsNull = True)
|    |    |-- annotatorType: string (nullable = True)
|    |    |-- begin: integer (nullable = False)
|    |    |-- end: integer (nullable = False)
|    |    |-- result: string (nullable = True)
|    |    |-- metadata: map (nullable = True)
|    |    |    |-- key: string
|    |    |    |-- value: string (valueContainsNull = True)
|    |    |-- embeddings: array (nullable = True)
|    |    |    |-- element: float (containsNull = False)

import spark.implicits._
import com.johnsnowlabs.nlp.MultiDocumentAssembler

val data = Seq("Spark NLP is an open-source text processing library.").toDF("text")
val multiDocumentAssembler = new MultiDocumentAssembler().setInputCols("text").setOutputCols("document")

val result = multiDocumentAssembler.transform(data)

result.select("document").show(false)
+----------------------------------------------------------------------------------------------+
|document                                                                                      |
+----------------------------------------------------------------------------------------------+
|[[document, 0, 51, Spark NLP is an open-source text processing library., [sentence -> 0], []]]|
+----------------------------------------------------------------------------------------------+

result.select("document").printSchema
root
 |-- document: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)

NGramGenerator

A feature transformer that converts the input array of strings (annotatorType TOKEN) into an array of n-grams (annotatorType CHUNK). Null values in the input array are ignored. It returns an array of n-grams where each n-gram is represented by a space-separated string of words.

When the input is empty, an empty array is returned. When the input array length is less than n (number of elements per n-gram), no n-grams are returned.

For more extended examples see the Examples and the NGramGeneratorTestSpec.

Input Annotator Types: TOKEN

Output Annotator Type: CHUNK

Python API: NGramGenerator

Scala API: NGramGenerator

Source: NGramGenerator

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

nGrams = NGramGenerator() \
    .setInputCols(["token"]) \
    .setOutputCol("ngrams") \
    .setN(2)

pipeline = Pipeline().setStages([
      documentAssembler,
      sentence,
      tokenizer,
      nGrams
    ])

data = spark.createDataFrame([["This is my sentence."]]).toDF("text")
results = pipeline.fit(data).transform(data)

results.selectExpr("explode(ngrams) as result").show(truncate=False)
+------------------------------------------------------------+
|result                                                      |
+------------------------------------------------------------+
|[chunk, 0, 6, This is, [sentence -> 0, chunk -> 0], []]     |
|[chunk, 5, 9, is my, [sentence -> 0, chunk -> 1], []]       |
|[chunk, 8, 18, my sentence, [sentence -> 0, chunk -> 2], []]|
|[chunk, 11, 19, sentence ., [sentence -> 0, chunk -> 3], []]|
+------------------------------------------------------------+

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.SentenceDetector
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.NGramGenerator
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentence = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val nGrams = new NGramGenerator()
  .setInputCols("token")
  .setOutputCol("ngrams")
  .setN(2)

val pipeline = new Pipeline().setStages(Array(
    documentAssembler,
    sentence,
    tokenizer,
    nGrams
  ))

val data = Seq("This is my sentence.").toDF("text")
val results = pipeline.fit(data).transform(data)

results.selectExpr("explode(ngrams) as result").show(false)
+------------------------------------------------------------+
|result                                                      |
+------------------------------------------------------------+
|[chunk, 0, 6, This is, [sentence -> 0, chunk -> 0], []]     |
|[chunk, 5, 9, is my, [sentence -> 0, chunk -> 1], []]       |
|[chunk, 8, 18, my sentence, [sentence -> 0, chunk -> 2], []]|
|[chunk, 11, 19, sentence ., [sentence -> 0, chunk -> 3], []]|
+------------------------------------------------------------+

NerConverter

Converts a IOB or IOB2 representation of NER to a user-friendly one, by associating the tokens of recognized entities and their label. Results in CHUNK Annotation type.

NER chunks can then be filtered by setting a whitelist with setWhiteList. Chunks with no associated entity (tagged “O”) are filtered.

See also Inside–outside–beginning (tagging) for more information.

Input Annotator Types: DOCUMENT, TOKEN, NAMED_ENTITY

Output Annotator Type: CHUNK

Python API: NerConverter

Scala API: NerConverter

Source: NerConverter

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
# This is a continuation of the example of the NerDLModel. See that class
# on how to extract the entities.
# The output of the NerDLModel follows the Annotator schema and can be converted like so:
#
# result.selectExpr("explode(ner)").show(truncate=False)
# +----------------------------------------------------+
# |col                                                 |
# +----------------------------------------------------+
# |[named_entity, 0, 2, B-ORG, [word -> U.N], []]      |
# |[named_entity, 3, 3, O, [word -> .], []]            |
# |[named_entity, 5, 12, O, [word -> official], []]    |
# |[named_entity, 14, 18, B-PER, [word -> Ekeus], []]  |
# |[named_entity, 20, 24, O, [word -> heads], []]      |
# |[named_entity, 26, 28, O, [word -> for], []]        |
# |[named_entity, 30, 36, B-LOC, [word -> Baghdad], []]|
# |[named_entity, 37, 37, O, [word -> .], []]          |
# +----------------------------------------------------+
#
# After the converter is used:
converter = NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("entities")

converter.transform(result).selectExpr("explode(entities)").show(truncate=False)
+------------------------------------------------------------------------+
|col                                                                     |
+------------------------------------------------------------------------+
|[chunk, 0, 2, U.N, [entity -> ORG, sentence -> 0, chunk -> 0], []]      |
|[chunk, 14, 18, Ekeus, [entity -> PER, sentence -> 0, chunk -> 1], []]  |
|[chunk, 30, 36, Baghdad, [entity -> LOC, sentence -> 0, chunk -> 2], []]|
+------------------------------------------------------------------------+

// This is a continuation of the example of the [[com.johnsnowlabs.nlp.annotators.ner.dl.NerDLModel NerDLModel]]. See that class
// on how to extract the entities.
// The output of the NerDLModel follows the Annotator schema and can be converted like so:
//
// result.selectExpr("explode(ner)").show(false)
// +----------------------------------------------------+
// |col                                                 |
// +----------------------------------------------------+
// |[named_entity, 0, 2, B-ORG, [word -> U.N], []]      |
// |[named_entity, 3, 3, O, [word -> .], []]            |
// |[named_entity, 5, 12, O, [word -> official], []]    |
// |[named_entity, 14, 18, B-PER, [word -> Ekeus], []]  |
// |[named_entity, 20, 24, O, [word -> heads], []]      |
// |[named_entity, 26, 28, O, [word -> for], []]        |
// |[named_entity, 30, 36, B-LOC, [word -> Baghdad], []]|
// |[named_entity, 37, 37, O, [word -> .], []]          |
// +----------------------------------------------------+
//
// After the converter is used:
val converter = new NerConverter()
  .setInputCols("sentence", "token", "ner")
  .setOutputCol("entities")
  .setPreservePosition(false)

converter.transform(result).selectExpr("explode(entities)").show(false)
+------------------------------------------------------------------------+
|col                                                                     |
+------------------------------------------------------------------------+
|[chunk, 0, 2, U.N, [entity -> ORG, sentence -> 0, chunk -> 0], []]      |
|[chunk, 14, 18, Ekeus, [entity -> PER, sentence -> 0, chunk -> 1], []]  |
|[chunk, 30, 36, Baghdad, [entity -> LOC, sentence -> 0, chunk -> 2], []]|
+------------------------------------------------------------------------+

NerCrf

Algorithm for training a Named Entity Recognition Model

For instantiated/pretrained models, see NerCrfModel.

This Named Entity recognition annotator allows for a generic model to be trained by utilizing a CRF machine learning algorithm. The training data should be a labeled Spark Dataset, e.g. CoNLL 2003 IOB with Annotation type columns. The data should have columns of type DOCUMENT, TOKEN, POS, WORD_EMBEDDINGS and an additional label column of annotator type NAMED_ENTITY. Excluding the label, this can be done with for example

a SentenceDetector,
a Tokenizer and
a PerceptronModel and
a WordEmbeddingsModel (any word embeddings can be chosen, e.g. BertEmbeddings for BERT based embeddings).

Optionally the user can provide an entity dictionary file with setExternalFeatures for better accuracy.

For extended examples of usage, see the Examples and the NerCrfApproachTestSpec.

Input Annotator Types: DOCUMENT, TOKEN, POS, WORD_EMBEDDINGS

Output Annotator Type: NAMED_ENTITY

Python API: NerCrfApproach

Scala API: NerCrfApproach

Source: NerCrfApproach

Show Example

# This CoNLL dataset already includes the sentence, token, pos and label column with their respective annotator types.
# If a custom dataset is used, these need to be defined.

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

embeddings = WordEmbeddingsModel.pretrained() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings") \
    .setCaseSensitive(False)

nerTagger = NerCrfApproach() \
    .setInputCols(["sentence", "token", "pos", "embeddings"]) \
    .setLabelColumn("label") \
    .setMinEpochs(1) \
    .setMaxEpochs(3) \
    .setC0(34) \
    .setL2(3.0) \
    .setOutputCol("ner")

pipeline = Pipeline().setStages([
    documentAssembler,
    embeddings,
    nerTagger
])


conll = CoNLL()
trainingData = conll.readDataset(spark, "src/test/resources/conll2003/eng.train")

pipelineModel = pipeline.fit(trainingData)

// This CoNLL dataset already includes the sentence, token, pos and label column with their respective annotator types.
// If a custom dataset is used, these need to be defined.

import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel
import com.johnsnowlabs.nlp.annotator.NerCrfApproach
import com.johnsnowlabs.nlp.training.CoNLL
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val embeddings = WordEmbeddingsModel.pretrained()
  .setInputCols("sentence", "token")
  .setOutputCol("embeddings")
  .setCaseSensitive(false)

val nerTagger = new NerCrfApproach()
  .setInputCols("sentence", "token", "pos", "embeddings")
  .setLabelColumn("label")
  .setMinEpochs(1)
  .setMaxEpochs(3)
  .setC0(34)
  .setL2(3.0)
  .setOutputCol("ner")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  embeddings,
  nerTagger
))


val conll = CoNLL()
val trainingData = conll.readDataset(spark, "src/test/resources/conll2003/eng.train")

val pipelineModel = pipeline.fit(trainingData)

NerDL

This Named Entity recognition annotator allows to train generic NER model based on Neural Networks.

The architecture of the neural network is a Char CNNs - BiLSTM - CRF that achieves state-of-the-art in most datasets.

For instantiated/pretrained models, see NerDLModel.

The training data should be a labeled Spark Dataset, in the format of CoNLL 2003 IOB with Annotation type columns. The data should have columns of type DOCUMENT, TOKEN, WORD_EMBEDDINGS and an additional label column of annotator type NAMED_ENTITY. Excluding the label, this can be done with for example

a SentenceDetector,
a Tokenizer and
a PerceptronModel and
a WordEmbeddingsModel (any word embeddings can be chosen, e.g. BertEmbeddings for BERT based embeddings).

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val embeddings = WordEmbeddingsModel
  .pretrained()
  .setInputCols("document", "token")
  .setOutputCol("embeddings")

val preProcessingPipeline = new Pipeline().setStages(Array(documentAssembler, embeddings))

val conll = CoNLL()
val Array(train, test) = conll
  .readDataset(spark, "src/test/resources/conll2003/eng.train")
  .randomSplit(Array(0.8, 0.2))

preProcessingPipeline
  .fit(test)
  .transform(test)
  .write
  .mode("overwrite")
  .parquet("test_data")

val nerTagger = new NerDLApproach()
  .setInputCols("document", "token", "embeddings")
  .setLabelColumn("label")
  .setOutputCol("ner")
  .setTestDataset("test_data")

For extended examples of usage, see the Examples and the NerDLSpec.

Input Annotator Types: DOCUMENT, TOKEN, WORD_EMBEDDINGS

Output Annotator Type: NAMED_ENTITY

Python API: NerDLApproach

Scala API: NerDLApproach

Source: NerDLApproach

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

# First extract the prerequisites for the NerDLApproach
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

embeddings = BertEmbeddings.pretrained() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

# Then the training can start
nerTagger = NerDLApproach() \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setLabelColumn("label") \
    .setOutputCol("ner") \
    .setMaxEpochs(1) \
    .setRandomSeed(0) \
    .setVerbose(0)

pipeline = Pipeline().setStages([
    documentAssembler,
    sentence,
    tokenizer,
    embeddings,
    nerTagger
])

# We use the text and labels from the CoNLL dataset
conll = CoNLL()
trainingData = conll.readDataset(spark, "src/test/resources/conll2003/eng.train")

pipelineModel = pipeline.fit(trainingData)

import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
import com.johnsnowlabs.nlp.embeddings.BertEmbeddings
import com.johnsnowlabs.nlp.annotators.ner.dl.NerDLApproach
import com.johnsnowlabs.nlp.training.CoNLL
import org.apache.spark.ml.Pipeline

// First extract the prerequisites for the NerDLApproach
val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentence = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

val embeddings = BertEmbeddings.pretrained()
  .setInputCols("sentence", "token")
  .setOutputCol("embeddings")

// Then the training can start
val nerTagger = new NerDLApproach()
.setInputCols("sentence", "token", "embeddings")
.setLabelColumn("label")
.setOutputCol("ner")
.setMaxEpochs(1)
.setRandomSeed(0)
.setVerbose(0)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentence,
  tokenizer,
  embeddings,
  nerTagger
))

// We use the text and labels from the CoNLL dataset
val conll = CoNLL()
val trainingData = conll.readDataset(spark, "src/test/resources/conll2003/eng.train")

val pipelineModel = pipeline.fit(trainingData)

NerOverwriter

Overwrites entities of specified strings.

The input for this Annotator have to be entities that are already extracted, Annotator type NAMED_ENTITY. The strings specified with setStopWords will have new entities assigned to, specified with setNewResult.

Input Annotator Types: NAMED_ENTITY

Output Annotator Type: NAMED_ENTITY

Python API: NerOverwriter

Scala API: NerOverwriter

Source: NerOverwriter

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

# First extract the prerequisite Entities
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

embeddings = WordEmbeddingsModel.pretrained() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("bert")

nerTagger = NerDLModel.pretrained() \
    .setInputCols(["sentence", "token", "bert"]) \
    .setOutputCol("ner")

pipeline = Pipeline().setStages([
    documentAssembler,
    sentence,
    tokenizer,
    embeddings,
    nerTagger
])

data = spark.createDataFrame([["Spark NLP Crosses Five Million Downloads, John Snow Labs Announces."]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("explode(ner)").show(truncate=False)
# +------------------------------------------------------+
# |col                                                   |
# +------------------------------------------------------+
# |[named_entity, 0, 4, B-ORG, [word -> Spark], []]      |
# |[named_entity, 6, 8, I-ORG, [word -> NLP], []]        |
# |[named_entity, 10, 16, O, [word -> Crosses], []]      |
# |[named_entity, 18, 21, O, [word -> Five], []]         |
# |[named_entity, 23, 29, O, [word -> Million], []]      |
# |[named_entity, 31, 39, O, [word -> Downloads], []]    |
# |[named_entity, 40, 40, O, [word -> ,], []]            |
# |[named_entity, 42, 45, B-ORG, [word -> John], []]     |
# |[named_entity, 47, 50, I-ORG, [word -> Snow], []]     |
# |[named_entity, 52, 55, I-ORG, [word -> Labs], []]     |
# |[named_entity, 57, 65, I-ORG, [word -> Announces], []]|
# |[named_entity, 66, 66, O, [word -> .], []]            |
# +------------------------------------------------------+

# The recognized entities can then be overwritten
nerOverwriter = NerOverwriter() \
    .setInputCols(["ner"]) \
    .setOutputCol("ner_overwritten") \
    .setStopWords(["Million"]) \
    .setNewResult("B-CARDINAL")

nerOverwriter.transform(result).selectExpr("explode(ner_overwritten)").show(truncate=False)
+---------------------------------------------------------+
|col                                                      |
+---------------------------------------------------------+
|[named_entity, 0, 4, B-ORG, [word -> Spark], []]         |
|[named_entity, 6, 8, I-ORG, [word -> NLP], []]           |
|[named_entity, 10, 16, O, [word -> Crosses], []]         |
|[named_entity, 18, 21, O, [word -> Five], []]            |
|[named_entity, 23, 29, B-CARDINAL, [word -> Million], []]|
|[named_entity, 31, 39, O, [word -> Downloads], []]       |
|[named_entity, 40, 40, O, [word -> ,], []]               |
|[named_entity, 42, 45, B-ORG, [word -> John], []]        |
|[named_entity, 47, 50, I-ORG, [word -> Snow], []]        |
|[named_entity, 52, 55, I-ORG, [word -> Labs], []]        |
|[named_entity, 57, 65, I-ORG, [word -> Announces], []]   |
|[named_entity, 66, 66, O, [word -> .], []]               |
+---------------------------------------------------------+

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
import com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel
import com.johnsnowlabs.nlp.annotators.ner.dl.NerDLModel
import com.johnsnowlabs.nlp.annotators.ner.NerOverwriter
import org.apache.spark.ml.Pipeline

// First extract the prerequisite Entities
val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentence = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

val embeddings = WordEmbeddingsModel.pretrained()
  .setInputCols("sentence", "token")
  .setOutputCol("bert")

val nerTagger = NerDLModel.pretrained()
  .setInputCols("sentence", "token", "bert")
  .setOutputCol("ner")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentence,
  tokenizer,
  embeddings,
  nerTagger
))

val data = Seq("Spark NLP Crosses Five Million Downloads, John Snow Labs Announces.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(ner)").show(false)
/
+------------------------------------------------------+
|col                                                   |
+------------------------------------------------------+
|[named_entity, 0, 4, B-ORG, [word -> Spark], []]      |
|[named_entity, 6, 8, I-ORG, [word -> NLP], []]        |
|[named_entity, 10, 16, O, [word -> Crosses], []]      |
|[named_entity, 18, 21, O, [word -> Five], []]         |
|[named_entity, 23, 29, O, [word -> Million], []]      |
|[named_entity, 31, 39, O, [word -> Downloads], []]    |
|[named_entity, 40, 40, O, [word -> ,], []]            |
|[named_entity, 42, 45, B-ORG, [word -> John], []]     |
|[named_entity, 47, 50, I-ORG, [word -> Snow], []]     |
|[named_entity, 52, 55, I-ORG, [word -> Labs], []]     |
|[named_entity, 57, 65, I-ORG, [word -> Announces], []]|
|[named_entity, 66, 66, O, [word -> .], []]            |
+------------------------------------------------------+
/
// The recognized entities can then be overwritten
val nerOverwriter = new NerOverwriter()
  .setInputCols("ner")
  .setOutputCol("ner_overwritten")
  .setStopWords(Array("Million"))
  .setNewResult("B-CARDINAL")

nerOverwriter.transform(result).selectExpr("explode(ner_overwritten)").show(false)
+---------------------------------------------------------+
|col                                                      |
+---------------------------------------------------------+
|[named_entity, 0, 4, B-ORG, [word -> Spark], []]         |
|[named_entity, 6, 8, I-ORG, [word -> NLP], []]           |
|[named_entity, 10, 16, O, [word -> Crosses], []]         |
|[named_entity, 18, 21, O, [word -> Five], []]            |
|[named_entity, 23, 29, B-CARDINAL, [word -> Million], []]|
|[named_entity, 31, 39, O, [word -> Downloads], []]       |
|[named_entity, 40, 40, O, [word -> ,], []]               |
|[named_entity, 42, 45, B-ORG, [word -> John], []]        |
|[named_entity, 47, 50, I-ORG, [word -> Snow], []]        |
|[named_entity, 52, 55, I-ORG, [word -> Labs], []]        |
|[named_entity, 57, 65, I-ORG, [word -> Announces], []]   |
|[named_entity, 66, 66, O, [word -> .], []]               |
+---------------------------------------------------------+

Normalizer

Annotator that cleans out tokens. Requires stems, hence tokens. Removes all dirty characters from text following a regex pattern and transforms words based on a provided dictionary

For extended examples of usage, see the Examples.

Input Annotator Types: TOKEN

Output Annotator Type: TOKEN

Python API: Normalizer

Scala API: Normalizer

Source: Normalizer

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

normalizer = Normalizer() \
    .setInputCols(["token"]) \
    .setOutputCol("normalized") \
    .setLowercase(True) \
    .setCleanupPatterns(["""[^\w\d\s]"""]) # remove punctuations (keep alphanumeric chars)
# if we don't set CleanupPatterns, it will only keep alphabet letters ([^A-Za-z])

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    normalizer
])

data = spark.createDataFrame([["John and Peter are brothers. However they don't support each other that much."]]) \
    .toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("normalized.result").show(truncate = False)
+----------------------------------------------------------------------------------------+
|result                                                                                  |
+----------------------------------------------------------------------------------------+
|[john, and, peter, are, brothers, however, they, dont, support, each, other, that, much]|
+----------------------------------------------------------------------------------------+

import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.{Normalizer, Tokenizer}
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val normalizer = new Normalizer()
  .setInputCols("token")
  .setOutputCol("normalized")
  .setLowercase(true)
  .setCleanupPatterns(Array("""[^\w\d\s]""")) // remove punctuations (keep alphanumeric chars)
// if we don't set CleanupPatterns, it will only keep alphabet letters ([^A-Za-z])

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  tokenizer,
  normalizer
))

val data = Seq("John and Peter are brothers. However they don't support each other that much.")
  .toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("normalized.result").show(truncate = false)
+----------------------------------------------------------------------------------------+
|result                                                                                  |
+----------------------------------------------------------------------------------------+
|[john, and, peter, are, brothers, however, they, dont, support, each, other, that, much]|
+----------------------------------------------------------------------------------------+

NorvigSweeting Spellchecker

Trains annotator, that retrieves tokens and makes corrections automatically if not found in an English dictionary.

The Symmetric Delete spelling correction algorithm reduces the complexity of edit candidate generation and dictionary lookup for a given Damerau-Levenshtein distance. It is six orders of magnitude faster (than the standard approach with deletes + transposes + replaces + inserts) and language independent. A dictionary of correct spellings must be provided with setDictionary as a text file, where each word is parsed by a regex pattern.

Inspired by Norvig model and SymSpell.

For instantiated/pretrained models, see NorvigSweetingModel.

For extended examples of usage, see the NorvigSweetingTestSpec.

Input Annotator Types: TOKEN

Output Annotator Type: TOKEN

Python API: NorvigSweetingApproach

Scala API: NorvigSweetingApproach

Source: NorvigSweetingApproach

Show Example

# In this example, the dictionary `"words.txt"` has the form of
#
# ...
# gummy
# gummic
# gummier
# gummiest
# gummiferous
# ...
#
# This dictionary is then set to be the basis of the spell checker.

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

spellChecker = NorvigSweetingApproach() \
    .setInputCols(["token"]) \
    .setOutputCol("spell") \
    .setDictionary("src/test/resources/spell/words.txt")

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    spellChecker
])

pipelineModel = pipeline.fit(trainingData)

// In this example, the dictionary `"words.txt"` has the form of
//
// ...
// gummy
// gummic
// gummier
// gummiest
// gummiferous
// ...
//
// This dictionary is then set to be the basis of the spell checker.

import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.spell.norvig.NorvigSweetingApproach
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val spellChecker = new NorvigSweetingApproach()
  .setInputCols("token")
  .setOutputCol("spell")
  .setDictionary("src/test/resources/spell/words.txt")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  tokenizer,
  spellChecker
))

val pipelineModel = pipeline.fit(trainingData)

POSTagger (Part of speech tagger)

Trains an averaged Perceptron model to tag words part-of-speech. Sets a POS tag to each word within a sentence.

For pretrained models please see the PerceptronModel.

The training data needs to be in a Spark DataFrame, where the column needs to consist of Annotations of type POS. The Annotation needs to have member result set to the POS tag and have a "word" mapping to its word inside of member metadata. This DataFrame for training can easily created by the helper class POS.

POS().readDataset(spark, datasetPath).selectExpr("explode(tags) as tags").show(false)
+---------------------------------------------+
|tags                                         |
+---------------------------------------------+
|[pos, 0, 5, NNP, [word -> Pierre], []]       |
|[pos, 7, 12, NNP, [word -> Vinken], []]      |
|[pos, 14, 14, ,, [word -> ,], []]            |
|[pos, 31, 34, MD, [word -> will], []]        |
|[pos, 36, 39, VB, [word -> join], []]        |
|[pos, 41, 43, DT, [word -> the], []]         |
|[pos, 45, 49, NN, [word -> board], []]       |
                      ...

For extended examples of usage, see the Examples and PerceptronApproach tests.

Input Annotator Types: TOKEN, DOCUMENT

Output Annotator Type: POS

Python API: PerceptronApproach

Scala API: PerceptronApproach

Source: PerceptronApproach

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

datasetPath = "src/test/resources/anc-pos-corpus-small/test-training.txt"
trainingPerceptronDF = POS().readDataset(spark, datasetPath)

trainedPos = PerceptronApproach() \
    .setInputCols(["document", "token"]) \
    .setOutputCol("pos") \
    .setPosColumn("tags") \
    .fit(trainingPerceptronDF)

pipeline = Pipeline().setStages([
    documentAssembler,
    sentence,
    tokenizer,
    trainedPos
])

data = spark.createDataFrame([["To be or not to be, is this the question?"]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("pos.result").show(truncate=False)
+--------------------------------------------------+
|result                                            |
+--------------------------------------------------+
|[NNP, NNP, CD, JJ, NNP, NNP, ,, MD, VB, DT, CD, .]|
+--------------------------------------------------+

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.SentenceDetector
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.training.POS
import com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronApproach
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentence = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

val datasetPath = "src/test/resources/anc-pos-corpus-small/test-training.txt"
val trainingPerceptronDF = POS().readDataset(spark, datasetPath)

val trainedPos = new PerceptronApproach()
  .setInputCols("document", "token")
  .setOutputCol("pos")
  .setPosColumn("tags")
  .fit(trainingPerceptronDF)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentence,
  tokenizer,
  trainedPos
))

val data = Seq("To be or not to be, is this the question?").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("pos.result").show(false)
+--------------------------------------------------+
|result                                            |
+--------------------------------------------------+
|[NNP, NNP, CD, JJ, NNP, NNP, ,, MD, VB, DT, CD, .]|
+--------------------------------------------------+

PromptAssembler

Assembles a sequence of messages into a single string using a template. These strings can then be used as prompts for large language models.

This annotator expects an array of two-tuples as the type of the input column (one array of tuples per row). The first element of the tuples should be the role and the second element is the text of the message. Possible roles are “system”, “user” and “assistant”.

An assistant header can be added to the end of the generated string by using setAddAssistant(true).

At the moment, this annotator uses llama.cpp as a backend to parse and apply the templates. llama.cpp uses basic pattern matching to determine the type of the template, then applies a basic version of the template to the messages. This means that more advanced templates are not supported.

For an extended example see the example notebook.

Input Annotator Types: NONE

Output Annotator Type: DOCUMENT

Python API: PromptAssembler

Scala API: PromptAssembler

Source: PromptAssembler

Show Example

from sparknlp.base import *

messages = [
    [
        ("system", "You are a helpful assistant."),
        ("assistant", "Hello there, how can I help you?"),
        ("user", "I need help with organizing my room."),
    ]
]
df = spark.createDataFrame([messages]).toDF("messages")


# llama3.1
template = (
    "{{- bos_token }} {%- if custom_tools is defined %} {%- set tools = custom_tools %} {%- "
    "endif %} {%- if not tools_in_user_message is defined %} {%- set tools_in_user_message = true %} {%- "
    'endif %} {%- if not date_string is defined %} {%- set date_string = "26 Jul 2024" %} {%- endif %} '
    "{%- if not tools is defined %} {%- set tools = none %} {%- endif %} {#- This block extracts the "
    "system message, so we can slot it into the right place. #} {%- if messages[0]['role'] == 'system' %}"
    " {%- set system_message = messages[0]['content']|trim %} {%- set messages = messages[1:] %} {%- else"
    ' %} {%- set system_message = "" %} {%- endif %} {#- System message + builtin tools #} {{- '
    '"<|start_header_id|>system<|end_header_id|>\\n\n" }} {%- if builtin_tools is defined or tools is '
    'not none %} {{- "Environment: ipython\\n" }} {%- endif %} {%- if builtin_tools is defined %} {{- '
    '"Tools: " + builtin_tools | reject(\'equalto\', \'code_interpreter\') | join(", ") + "\\n\n"}} '
    '{%- endif %} {{- "Cutting Knowledge Date: December 2023\\n" }} {{- "Today Date: " + date_string '
    '+ "\\n\n" }} {%- if tools is not none and not tools_in_user_message %} {{- "You have access to '
    'the following functions. To call a function, please respond with JSON for a function call." }} {{- '
    '\'Respond in the format {"name": function name, "parameters": dictionary of argument name and its'
    ' value}.\' }} {{- "Do not use variables.\\n\n" }} {%- for t in tools %} {{- t | tojson(indent=4) '
    '}} {{- "\\n\n" }} {%- endfor %} {%- endif %} {{- system_message }} {{- "<|eot_id|>" }} {#- '
    "Custom tools are passed in a user message with some extra guidance #} {%- if tools_in_user_message "
    "and not tools is none %} {#- Extract the first user message so we can plug it in here #} {%- if "
    "messages | length != 0 %} {%- set first_user_message = messages[0]['content']|trim %} {%- set "
    'messages = messages[1:] %} {%- else %} {{- raise_exception("Cannot put tools in the first user '
    "message when there's no first user message!\") }} {%- endif %} {{- "
    "'<|start_header_id|>user<|end_header_id|>\\n\n' -}} {{- \"Given the following functions, please "
    'respond with a JSON for a function call " }} {{- "with its proper arguments that best answers the '
    'given prompt.\\n\n" }} {{- \'Respond in the format {"name": function name, "parameters": '
    'dictionary of argument name and its value}.\' }} {{- "Do not use variables.\\n\n" }} {%- for t in '
    'tools %} {{- t | tojson(indent=4) }} {{- "\\n\n" }} {%- endfor %} {{- first_user_message + '
    "\"<|eot_id|>\"}} {%- endif %} {%- for message in messages %} {%- if not (message.role == 'ipython' "
    "or message.role == 'tool' or 'tool_calls' in message) %} {{- '<|start_header_id|>' + message['role']"
    " + '<|end_header_id|>\\n\n'+ message['content'] | trim + '<|eot_id|>' }} {%- elif 'tool_calls' in "
    'message %} {%- if not message.tool_calls|length == 1 %} {{- raise_exception("This model only '
    'supports single tool-calls at once!") }} {%- endif %} {%- set tool_call = message.tool_calls[0]'
    ".function %} {%- if builtin_tools is defined and tool_call.name in builtin_tools %} {{- "
    "'<|start_header_id|>assistant<|end_header_id|>\\n\n' -}} {{- \"<|python_tag|>\" + tool_call.name + "
    '".call(" }} {%- for arg_name, arg_val in tool_call.arguments | items %} {{- arg_name + \'="\' + '
    'arg_val + \'"\' }} {%- if not loop.last %} {{- ", " }} {%- endif %} {%- endfor %} {{- ")" }} {%- '
    "else %} {{- '<|start_header_id|>assistant<|end_header_id|>\\n\n' -}} {{- '{\"name\": \"' + "
    'tool_call.name + \'", \' }} {{- \'"parameters": \' }} {{- tool_call.arguments | tojson }} {{- "}" '
    "}} {%- endif %} {%- if builtin_tools is defined %} {#- This means we're in ipython mode #} {{- "
    '"<|eom_id|>" }} {%- else %} {{- "<|eot_id|>" }} {%- endif %} {%- elif message.role == "tool" '
    'or message.role == "ipython" %} {{- "<|start_header_id|>ipython<|end_header_id|>\\n\n" }} {%- '
    "if message.content is mapping or message.content is iterable %} {{- message.content | tojson }} {%- "
    'else %} {{- message.content }} {%- endif %} {{- "<|eot_id|>" }} {%- endif %} {%- endfor %} {%- if '
    "add_generation_prompt %} {{- '<|start_header_id|>assistant<|end_header_id|>\\n\n' }} {%- endif %} "
)


prompt_assembler = (
    PromptAssembler()
    .setInputCol("messages")
    .setOutputCol("prompt")
    .setChatTemplate(template)
)

prompt_assembler.transform(df).select("prompt.result").show(truncate=False)
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                      |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[<|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nHello there, how can I help you?<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nI need help with organizing my room.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n]|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

// Batches (whole conversations) of arrays of messages
val data: Seq[Seq[(String, String)]] = Seq(
  Seq(
    ("system", "You are a helpful assistant."),
    ("assistant", "Hello there, how can I help you?"),
    ("user", "I need help with organizing my room.")))

val dataDF = data.toDF("messages")


// llama3.1
val template =
  "{{- bos_token }} {%- if custom_tools is defined %} {%- set tools = custom_tools %} {%- " +
    "endif %} {%- if not tools_in_user_message is defined %} {%- set tools_in_user_message = true %} {%- " +
    "endif %} {%- if not date_string is defined %} {%- set date_string = \"26 Jul 2024\" %} {%- endif %} " +
    "{%- if not tools is defined %} {%- set tools = none %} {%- endif %} {#- This block extracts the " +
    "system message, so we can slot it into the right place. #} {%- if messages[0]['role'] == 'system' %}" +
    " {%- set system_message = messages[0]['content']|trim %} {%- set messages = messages[1:] %} {%- else" +
    " %} {%- set system_message = \"\" %} {%- endif %} {#- System message + builtin tools #} {{- " +
    "\"<|start_header_id|>system<|end_header_id|>\\n\\n\" }} {%- if builtin_tools is defined or tools is " +
    "not none %} {{- \"Environment: ipython\\n\" }} {%- endif %} {%- if builtin_tools is defined %} {{- " +
    "\"Tools: \" + builtin_tools | reject('equalto', 'code_interpreter') | join(\", \") + \"\\n\\n\"}} " +
    "{%- endif %} {{- \"Cutting Knowledge Date: December 2023\\n\" }} {{- \"Today Date: \" + date_string " +
    "+ \"\\n\\n\" }} {%- if tools is not none and not tools_in_user_message %} {{- \"You have access to " +
    "the following functions. To call a function, please respond with JSON for a function call.\" }} {{- " +
    "'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its" +
    " value}.' }} {{- \"Do not use variables.\\n\\n\" }} {%- for t in tools %} {{- t | tojson(indent=4) " +
    "}} {{- \"\\n\\n\" }} {%- endfor %} {%- endif %} {{- system_message }} {{- \"<|eot_id|>\" }} {#- " +
    "Custom tools are passed in a user message with some extra guidance #} {%- if tools_in_user_message " +
    "and not tools is none %} {#- Extract the first user message so we can plug it in here #} {%- if " +
    "messages | length != 0 %} {%- set first_user_message = messages[0]['content']|trim %} {%- set " +
    "messages = messages[1:] %} {%- else %} {{- raise_exception(\"Cannot put tools in the first user " +
    "message when there's no first user message!\") }} {%- endif %} {{- " +
    "'<|start_header_id|>user<|end_header_id|>\\n\\n' -}} {{- \"Given the following functions, please " +
    "respond with a JSON for a function call \" }} {{- \"with its proper arguments that best answers the " +
    "given prompt.\\n\\n\" }} {{- 'Respond in the format {\"name\": function name, \"parameters\": " +
    "dictionary of argument name and its value}.' }} {{- \"Do not use variables.\\n\\n\" }} {%- for t in " +
    "tools %} {{- t | tojson(indent=4) }} {{- \"\\n\\n\" }} {%- endfor %} {{- first_user_message + " +
    "\"<|eot_id|>\"}} {%- endif %} {%- for message in messages %} {%- if not (message.role == 'ipython' " +
    "or message.role == 'tool' or 'tool_calls' in message) %} {{- '<|start_header_id|>' + message['role']" +
    " + '<|end_header_id|>\\n\\n'+ message['content'] | trim + '<|eot_id|>' }} {%- elif 'tool_calls' in " +
    "message %} {%- if not message.tool_calls|length == 1 %} {{- raise_exception(\"This model only " +
    "supports single tool-calls at once!\") }} {%- endif %} {%- set tool_call = message.tool_calls[0]" +
    ".function %} {%- if builtin_tools is defined and tool_call.name in builtin_tools %} {{- " +
    "'<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}} {{- \"<|python_tag|>\" + tool_call.name + " +
    "\".call(\" }} {%- for arg_name, arg_val in tool_call.arguments | items %} {{- arg_name + '=\"' + " +
    "arg_val + '\"' }} {%- if not loop.last %} {{- \", \" }} {%- endif %} {%- endfor %} {{- \")\" }} {%- " +
    "else %} {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}} {{- '{\"name\": \"' + " +
    "tool_call.name + '\", ' }} {{- '\"parameters\": ' }} {{- tool_call.arguments | tojson }} {{- \"}\" " +
    "}} {%- endif %} {%- if builtin_tools is defined %} {#- This means we're in ipython mode #} {{- " +
    "\"<|eom_id|>\" }} {%- else %} {{- \"<|eot_id|>\" }} {%- endif %} {%- elif message.role == \"tool\" " +
    "or message.role == \"ipython\" %} {{- \"<|start_header_id|>ipython<|end_header_id|>\\n\\n\" }} {%- " +
    "if message.content is mapping or message.content is iterable %} {{- message.content | tojson }} {%- " +
    "else %} {{- message.content }} {%- endif %} {{- \"<|eot_id|>\" }} {%- endif %} {%- endfor %} {%- if " +
    "add_generation_prompt %} {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }} {%- endif %} "


val promptAssembler = new PromptAssembler()
  .setInputCol("messages")
  .setOutputCol("prompt")
  .setChatTemplate(template)

promptAssembler.transform(dataDF).select("prompt.result").show(truncate = false)
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                      |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[<|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nHello there, how can I help you?<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nI need help with organizing my room.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n]|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

RecursiveTokenizer

Tokenizes raw text recursively based on a handful of definable rules.

Unlike the Tokenizer, the RecursiveTokenizer operates based on these array string parameters only:

prefixes: Strings that will be split when found at the beginning of token.
suffixes: Strings that will be split when found at the end of token.
infixes: Strings that will be split when found at the middle of token.
whitelist: Whitelist of strings not to split

For extended examples of usage, see the Examples and the TokenizerTestSpec.

Input Annotator Types: DOCUMENT

Output Annotator Type: TOKEN

Python API: RecursiveTokenizer

Scala API: RecursiveTokenizer

Source: RecursiveTokenizer

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = RecursiveTokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer
])

data = spark.createDataFrame([["One, after the Other, (and) again. PO, QAM,"]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.select("token.result").show(truncate=False)
+------------------------------------------------------------------+
|result                                                            |
+------------------------------------------------------------------+
|[One, ,, after, the, Other, ,, (, and, ), again, ., PO, ,, QAM, ,]|
+------------------------------------------------------------------+

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.RecursiveTokenizer
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new RecursiveTokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  tokenizer
))

val data = Seq("One, after the Other, (and) again. PO, QAM,").toDF("text")
val result = pipeline.fit(data).transform(data)

result.select("token.result").show(false)
+------------------------------------------------------------------+
|result                                                            |
+------------------------------------------------------------------+
|[One, ,, after, the, Other, ,, (, and, ), again, ., PO, ,, QAM, ,]|
+------------------------------------------------------------------+

RegexMatcher

Uses rules to match a set of regular expressions and associate them with a provided identifier.

A rule consists of a regex pattern and an identifier, delimited by a character of choice. An example could be "\d{4}\/\d\d\/\d\d,date" which will match strings like "1970/01/01" to the identifier "date".

Rules must be provided by either setRules (followed by setDelimiter) or an external file.

To use an external file, a dictionary of predefined regular expressions must be provided with setExternalRules. The dictionary can be set as a delimited text file.

Pretrained pipelines are available for this module, see Pipelines.

For extended examples of usage, see the Examples and the RegexMatcherTestSpec.

Input Annotator Types: DOCUMENT

Output Annotator Type: CHUNK

Python API: RegexMatcher

Scala API: RegexMatcher

Source: RegexMatcher

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
# In this example, the `rules.txt` has the form of
#
# the\s\w+, followed by 'the'
# ceremonies, ceremony
#
# where each regex is separated by the identifier by `","`

documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")

sentence = SentenceDetector().setInputCols(["document"]).setOutputCol("sentence")

regexMatcher = RegexMatcher() \
    .setExternalRules("src/test/resources/regex-matcher/rules.txt",  ",") \
    .setInputCols(["sentence"]) \
    .setOutputCol("regex") \
    .setStrategy("MATCH_ALL")

pipeline = Pipeline().setStages([documentAssembler, sentence, regexMatcher])

data = spark.createDataFrame([[
    "My first sentence with the first rule. This is my second sentence with ceremonies rule."
]]).toDF("text")
results = pipeline.fit(data).transform(data)

results.selectExpr("explode(regex) as result").show(truncate=False)
+--------------------------------------------------------------------------------------------+
|result                                                                                      |
+--------------------------------------------------------------------------------------------+
|[chunk, 23, 31, the first, [identifier -> followed by 'the', sentence -> 0, chunk -> 0], []]|
|[chunk, 71, 80, ceremonies, [identifier -> ceremony, sentence -> 1, chunk -> 0], []]        |
+--------------------------------------------------------------------------------------------+

// In this example, the `rules.txt` has the form of
//
// the\s\w+, followed by 'the'
// ceremonies, ceremony
//
// where each regex is separated by the identifier by `","`
import ResourceHelper.spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.SentenceDetector
import com.johnsnowlabs.nlp.annotators.RegexMatcher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document")

val sentence = new SentenceDetector().setInputCols("document").setOutputCol("sentence")

val regexMatcher = new RegexMatcher()
  .setExternalRules("src/test/resources/regex-matcher/rules.txt",  ",")
  .setInputCols(Array("sentence"))
  .setOutputCol("regex")
  .setStrategy("MATCH_ALL")

val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, regexMatcher))

val data = Seq(
  "My first sentence with the first rule. This is my second sentence with ceremonies rule."
).toDF("text")
val results = pipeline.fit(data).transform(data)

results.selectExpr("explode(regex) as result").show(false)
+--------------------------------------------------------------------------------------------+
|result                                                                                      |
+--------------------------------------------------------------------------------------------+
|[chunk, 23, 31, the first, [identifier -> followed by 'the', sentence -> 0, chunk -> 0], []]|
|[chunk, 71, 80, ceremonies, [identifier -> ceremony, sentence -> 1, chunk -> 0], []]        |
+--------------------------------------------------------------------------------------------+

RegexTokenizer

A tokenizer that splits text by a regex pattern.

The pattern needs to be set with setPattern and this sets the delimiting pattern or how the tokens should be split. By default this pattern is \s+ which means that tokens should be split by 1 or more whitespace characters.

Input Annotator Types: DOCUMENT

Output Annotator Type: TOKEN

Python API: RegexTokenizer

Scala API: RegexTokenizer

Source: RegexTokenizer

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

regexTokenizer = RegexTokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("regexToken") \
    .setToLowercase(True) \
    .setPattern("\\s+")

pipeline = Pipeline().setStages([
      documentAssembler,
      regexTokenizer
    ])

data = spark.createDataFrame([["This is my first sentence.\nThis is my second."]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("regexToken.result").show(truncate=False)
+-------------------------------------------------------+
|result                                                 |
+-------------------------------------------------------+
|[this, is, my, first, sentence., this, is, my, second.]|
+-------------------------------------------------------+

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.RegexTokenizer
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val regexTokenizer = new RegexTokenizer()
  .setInputCols("document")
  .setOutputCol("regexToken")
  .setToLowercase(true)
  .setPattern("\\s+")

val pipeline = new Pipeline().setStages(Array(
    documentAssembler,
    regexTokenizer
  ))

val data = Seq("This is my first sentence.\nThis is my second.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("regexToken.result").show(false)
+-------------------------------------------------------+
|result                                                 |
+-------------------------------------------------------+
|[this, is, my, first, sentence., this, is, my, second.]|
+-------------------------------------------------------+

SentenceDetector

Annotator that detects sentence boundaries using regular expressions.

The following characters are checked as sentence boundaries:

Lists (“(i), (ii)”, “(a), (b)”, “1., 2.”)
Numbers
Abbreviations
Punctuations
Multiple Periods
Geo-Locations/Coordinates (“N°. 1026.253.553.”)
Ellipsis (“…”)
In-between punctuations
Quotation marks
Exclamation Points
Basic Breakers (“.”, “;”)

For the explicit regular expressions used for detection, refer to source of PragmaticContentFormatter.

To add additional custom bounds, the parameter customBounds can be set with an array:

val sentence = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")
  .setCustomBounds(Array("\n\n"))

If only the custom bounds should be used, then the parameter useCustomBoundsOnly should be set to true.

Each extracted sentence can be returned in an Array or exploded to separate rows, if explodeSentences is set to true.

For extended examples of usage, see the Examples.

Input Annotator Types: DOCUMENT

Output Annotator Type: DOCUMENT

Python API: SentenceDetector

Scala API: SentenceDetector

Source: SentenceDetector

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence") \
    .setCustomBounds(["\n\n"])

pipeline = Pipeline().setStages([
    documentAssembler,
    sentence
])

data = spark.createDataFrame([["This is my first sentence. This my second. How about a third?"]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("explode(sentence) as sentences").show(truncate=False)
+------------------------------------------------------------------+
|sentences                                                         |
+------------------------------------------------------------------+
|[document, 0, 25, This is my first sentence., [sentence -> 0], []]|
|[document, 27, 41, This my second., [sentence -> 1], []]          |
|[document, 43, 60, How about a third?, [sentence -> 2], []]       |
+------------------------------------------------------------------+

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.SentenceDetector
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentence = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")
  .setCustomBounds(Array("\n\n"))

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentence
))

val data = Seq("This is my first sentence. This my second. How about a third?").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(sentence) as sentences").show(false)
+------------------------------------------------------------------+
|sentences                                                         |
+------------------------------------------------------------------+
|[document, 0, 25, This is my first sentence., [sentence -> 0], []]|
|[document, 27, 41, This my second., [sentence -> 1], []]          |
|[document, 43, 60, How about a third?, [sentence -> 2], []]       |
+------------------------------------------------------------------+

SentenceDetectorDL

Trains an annotator that detects sentence boundaries using a deep learning approach.

For pretrained models see SentenceDetectorDLModel.

Currently, only the CNN model is supported for training, but in the future the architecture of the model can be set with setModelArchitecture.

The default model "cnn" is based on the paper Deep-EOS: General-Purpose Neural Networks for Sentence Boundary Detection (2020, Stefan Schweter, Sajawel Ahmed) using a CNN architecture. We also modified the original implementation a little bit to cover broken sentences and some impossible end of line chars.

Each extracted sentence can be returned in an Array or exploded to separate rows, if explodeSentences is set to true.

For extended examples of usage, see the Examples and the SentenceDetectorDLSpec.

Input Annotator Types: DOCUMENT

Output Annotator Type: DOCUMENT

Python API: SentenceDetectorDLApproach

Scala API: SentenceDetectorDLApproach

Source: SentenceDetectorDLApproach

Show Example

# The training process needs data, where each data point is a sentence.
# In this example the `train.txt` file has the form of
#
# ...
# Slightly more moderate language would make our present situation – namely the lack of progress – a little easier.
# His political successors now have great responsibilities to history and to the heritage of values bequeathed to them by Nelson Mandela.
# ...
#
# where each line is one sentence.
# Training can then be started like so:

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

trainingData = spark.read.text("train.txt").toDF("text")

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = SentenceDetectorDLApproach() \
    .setInputCols(["document"]) \
    .setOutputCol("sentences") \
    .setEpochsNumber(100)

pipeline = Pipeline().setStages([documentAssembler, sentenceDetector])

model = pipeline.fit(trainingData)

// The training process needs data, where each data point is a sentence.
// In this example the `train.txt` file has the form of
//
// ...
// Slightly more moderate language would make our present situation – namely the lack of progress – a little easier.
// His political successors now have great responsibilities to history and to the heritage of values bequeathed to them by Nelson Mandela.
// ...
//
// where each line is one sentence.
// Training can then be started like so:
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.sentence_detector_dl.SentenceDetectorDLApproach
import org.apache.spark.ml.Pipeline

val trainingData = spark.read.text("train.txt").toDF("text")

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetectorDLApproach()
  .setInputCols(Array("document"))
  .setOutputCol("sentences")
  .setEpochsNumber(100)

val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector))

val model = pipeline.fit(trainingData)

SentenceEmbeddings

Converts the results from WordEmbeddings, BertEmbeddings, or ElmoEmbeddings into sentence or document embeddings by either summing up or averaging all the word embeddings in a sentence or a document (depending on the inputCols).

This can be configured with setPoolingStrategy, which either be "AVERAGE" or "SUM".

For more extended examples see the Examples. and the SentenceEmbeddingsTestSpec.

TIP: Here is how you can explode and convert these embeddings into Vectors or what’s known as Feature column so it can be used in Spark ML regression or clustering functions:

from org.apache.spark.ml.linal import Vector, Vectors
from pyspark.sql.functions import udf
# Let's create a UDF to take array of embeddings and output Vectors
@udf(Vector)
def convertToVectorUDF(matrix):
    return Vectors.dense(matrix.toArray.map(_.toDouble))


# Now let's explode the sentence_embeddings column and have a new feature column for Spark ML
pipelineDF.select(explode("sentence_embeddings.embeddings").as("sentence_embedding"))
.withColumn("features", convertToVectorUDF("sentence_embedding"))

import org.apache.spark.ml.linalg.{Vector, Vectors}

// Let's create a UDF to take array of embeddings and output Vectors
val convertToVectorUDF = udf((matrix : Seq[Float]) => {
    Vectors.dense(matrix.toArray.map(_.toDouble))
})

// Now let's explode the sentence_embeddings column and have a new feature column for Spark ML
pipelineDF.select(explode($"sentence_embeddings.embeddings").as("sentence_embedding"))
.withColumn("features", convertToVectorUDF($"sentence_embedding"))

Input Annotator Types: DOCUMENT, WORD_EMBEDDINGS

Output Annotator Type: SENTENCE_EMBEDDINGS

Note: If you choose document as your input for Tokenizer, WordEmbeddings/BertEmbeddings, and SentenceEmbeddings then it averages/sums all the embeddings into one array of embeddings. However, if you choose sentence as inputCols then for each sentence SentenceEmbeddings generates one array of embeddings.

Python API: SentenceEmbeddings

Scala API: SentenceEmbeddings

Source: SentenceEmbeddings

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

embeddings = WordEmbeddingsModel.pretrained() \
    .setInputCols(["document", "token"]) \
    .setOutputCol("embeddings")

embeddingsSentence = SentenceEmbeddings() \
    .setInputCols(["document", "embeddings"]) \
    .setOutputCol("sentence_embeddings") \
    .setPoolingStrategy("AVERAGE")

embeddingsFinisher = EmbeddingsFinisher() \
    .setInputCols(["sentence_embeddings"]) \
    .setOutputCols("finished_embeddings") \
    .setOutputAsVector(True) \
    .setCleanAnnotations(False)

pipeline = Pipeline() \
    .setStages([
      documentAssembler,
      tokenizer,
      embeddings,
      embeddingsSentence,
      embeddingsFinisher
    ])

data = spark.createDataFrame([["This is a sentence."]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[-0.22093398869037628,0.25130119919776917,0.41810303926467896,-0.380883991718...|
+--------------------------------------------------------------------------------+

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel
import com.johnsnowlabs.nlp.embeddings.SentenceEmbeddings
import com.johnsnowlabs.nlp.EmbeddingsFinisher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols(Array("document"))
  .setOutputCol("token")

val embeddings = WordEmbeddingsModel.pretrained()
  .setInputCols("document", "token")
  .setOutputCol("embeddings")

val embeddingsSentence = new SentenceEmbeddings()
  .setInputCols(Array("document", "embeddings"))
  .setOutputCol("sentence_embeddings")
  .setPoolingStrategy("AVERAGE")

val embeddingsFinisher = new EmbeddingsFinisher()
  .setInputCols("sentence_embeddings")
  .setOutputCols("finished_embeddings")
  .setOutputAsVector(true)
  .setCleanAnnotations(false)

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    tokenizer,
    embeddings,
    embeddingsSentence,
    embeddingsFinisher
  ))

val data = Seq("This is a sentence.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[-0.22093398869037628,0.25130119919776917,0.41810303926467896,-0.380883991718...|
+--------------------------------------------------------------------------------+

SentimentDL

Trains a SentimentDL, an annotator for multi-class sentiment analysis.

In natural language processing, sentiment analysis is the task of classifying the affective state or subjective view of a text. A common example is if either a product review or tweet can be interpreted positively or negatively.

For the instantiated/pretrained models, see SentimentDLModel.

Notes:

This annotator accepts a label column of a single item in either type of String, Int, Float, or Double. So positive sentiment can be expressed as either "positive" or 0, negative sentiment as "negative" or 1.
UniversalSentenceEncoder, BertSentenceEmbeddings, SentenceEmbeddings or other sentence based embeddings can be used

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val embeddings = UniversalSentenceEncoder.pretrained()
  .setInputCols("document")
  .setOutputCol("sentence_embeddings")

val preProcessingPipeline = new Pipeline().setStages(Array(documentAssembler, embeddings))

val Array(train, test) = data.randomSplit(Array(0.8, 0.2))
preProcessingPipeline
  .fit(test)
  .transform(test)
  .write
  .mode("overwrite")
  .parquet("test_data")

val classifier = new SentimentDLApproach()
  .setInputCols("sentence_embeddings")
  .setOutputCol("sentiment")
  .setLabelColumn("label")
  .setTestDataset("test_data")

For extended examples of usage, see the Examples and the SentimentDLTestSpec.

Input Annotator Types: SENTENCE_EMBEDDINGS

Output Annotator Type: CATEGORY

Python API: SentimentDLApproach

Scala API: SentimentDLApproach

Source: SentimentDLApproach

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
# In this example, `sentiment.csv` is in the form
#
# text,label
# This movie is the best movie I have watched ever! In my opinion this movie can win an award.,0
# This was a terrible movie! The acting was bad really bad!,1
#
# The model can then be trained with

smallCorpus = spark.read.option("header", "True").csv("src/test/resources/classifier/sentiment.csv")

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

useEmbeddings = UniversalSentenceEncoder.pretrained() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence_embeddings")

docClassifier = SentimentDLApproach() \
    .setInputCols(["sentence_embeddings"]) \
    .setOutputCol("sentiment") \
    .setLabelColumn("label") \
    .setBatchSize(32) \
    .setMaxEpochs(1) \
    .setLr(5e-3) \
    .setDropout(0.5)

pipeline = Pipeline() \
    .setStages(
      [
        documentAssembler,
        useEmbeddings,
        docClassifier
      ]
    )

pipelineModel = pipeline.fit(smallCorpus)

// In this example, `sentiment.csv` is in the form
//
// text,label
// This movie is the best movie I have watched ever! In my opinion this movie can win an award.,0
// This was a terrible movie! The acting was bad really bad!,1
//
// The model can then be trained with
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.UniversalSentenceEncoder
import com.johnsnowlabs.nlp.annotators.classifier.dl.{SentimentDLApproach, SentimentDLModel}
import org.apache.spark.ml.Pipeline

val smallCorpus = spark.read.option("header", "true").csv("src/test/resources/classifier/sentiment.csv")

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val useEmbeddings = UniversalSentenceEncoder.pretrained()
  .setInputCols("document")
  .setOutputCol("sentence_embeddings")

val docClassifier = new SentimentDLApproach()
  .setInputCols("sentence_embeddings")
  .setOutputCol("sentiment")
  .setLabelColumn("label")
  .setBatchSize(32)
  .setMaxEpochs(1)
  .setLr(5e-3f)
  .setDropout(0.5f)

val pipeline = new Pipeline()
  .setStages(
    Array(
      documentAssembler,
      useEmbeddings,
      docClassifier
    )
  )

val pipelineModel = pipeline.fit(smallCorpus)

SentimentDetector

Trains a rule based sentiment detector, which calculates a score based on predefined keywords.

A dictionary of predefined sentiment keywords must be provided with setDictionary, where each line is a word delimited to its class (either positive or negative). The dictionary can be set as a delimited text file.

By default, the sentiment score will be assigned labels "positive" if the score is >= 0, else "negative". To retrieve the raw sentiment scores, enableScore needs to be set to true.

For extended examples of usage, see the Examples and the SentimentTestSpec.

Input Annotator Types: TOKEN, DOCUMENT

Output Annotator Type: SENTIMENT

Python API: SentimentDetector

Scala API: SentimentDetector

Source: SentimentDetector

Show Example

# In this example, the dictionary `default-sentiment-dict.txt` has the form of
#
# ...
# cool,positive
# superb,positive
# bad,negative
# uninspired,negative
# ...
#
# where each sentiment keyword is delimited by `","`.

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

lemmatizer = Lemmatizer() \
    .setInputCols(["token"]) \
    .setOutputCol("lemma") \
    .setDictionary("lemmas_small.txt", "->", "\t")

sentimentDetector = SentimentDetector() \
    .setInputCols(["lemma", "document"]) \
    .setOutputCol("sentimentScore") \
    .setDictionary("default-sentiment-dict.txt", ",", ReadAs.TEXT)

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    lemmatizer,
    sentimentDetector,
])

data = spark.createDataFrame([
    ["The staff of the restaurant is nice"],
    ["I recommend others to avoid because it is too expensive"]
]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("sentimentScore.result").show(truncate=False)
+----------+  #  +------+ for enableScore set to True
|result    |  #  |result|
+----------+  #  +------+
|[positive]|  #  |[1.0] |
|[negative]|  #  |[-2.0]|
+----------+  #  +------+

// In this example, the dictionary `default-sentiment-dict.txt` has the form of
//
// ...
// cool,positive
// superb,positive
// bad,negative
// uninspired,negative
// ...
//
// where each sentiment keyword is delimited by `","`.

import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.Tokenizer
import com.johnsnowlabs.nlp.annotators.Lemmatizer
import com.johnsnowlabs.nlp.annotators.sda.pragmatic.SentimentDetector
import com.johnsnowlabs.nlp.util.io.ReadAs
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val lemmatizer = new Lemmatizer()
  .setInputCols("token")
  .setOutputCol("lemma")
  .setDictionary("src/test/resources/lemma-corpus-small/lemmas_small.txt", "->", "\t")

val sentimentDetector = new SentimentDetector()
  .setInputCols("lemma", "document")
  .setOutputCol("sentimentScore")
  .setDictionary("src/test/resources/sentiment-corpus/default-sentiment-dict.txt", ",", ReadAs.TEXT)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  tokenizer,
  lemmatizer,
  sentimentDetector,
))

val data = Seq(
  "The staff of the restaurant is nice",
  "I recommend others to avoid because it is too expensive"
).toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("sentimentScore.result").show(false)
+----------+  //  +------+ for enableScore set to true
|result    |  //  |result|
+----------+  //  +------+
|[positive]|  //  |[1.0] |
|[negative]|  //  |[-2.0]|
+----------+  //  +------+

Stemmer

Returns hard-stems out of words with the objective of retrieving the meaningful part of the word. For extended examples of usage, see the Examples.

Input Annotator Types: TOKEN

Output Annotator Type: TOKEN

Python API: Stemmer

Scala API: Stemmer

Source: Stemmer

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

stemmer = Stemmer() \
    .setInputCols(["token"]) \
    .setOutputCol("stem")

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    stemmer
])

data = spark.createDataFrame([["Peter Pipers employees are picking pecks of pickled peppers."]]) \
    .toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("stem.result").show(truncate = False)
+-------------------------------------------------------------+
|result                                                       |
+-------------------------------------------------------------+
|[peter, piper, employe, ar, pick, peck, of, pickl, pepper, .]|
+-------------------------------------------------------------+

import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.{Stemmer, Tokenizer}
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val stemmer = new Stemmer()
  .setInputCols("token")
  .setOutputCol("stem")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  tokenizer,
  stemmer
))

val data = Seq("Peter Pipers employees are picking pecks of pickled peppers.")
  .toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("stem.result").show(truncate = false)
+-------------------------------------------------------------+
|result                                                       |
+-------------------------------------------------------------+
|[peter, piper, employe, ar, pick, peck, of, pickl, pepper, .]|
+-------------------------------------------------------------+

StopWordsCleaner

This annotator takes a sequence of strings (e.g. the output of a Tokenizer, Normalizer, Lemmatizer, and Stemmer) and drops all the stop words from the input sequences.

By default, it uses stop words from MLlibs StopWordsRemover. Stop words can also be defined by explicitly setting them with setStopWords(value: Array[String]) or loaded from pretrained models using pretrained of its companion object.

val stopWords = StopWordsCleaner.pretrained()
  .setInputCols("token")
  .setOutputCol("cleanTokens")
  .setCaseSensitive(false)
// will load the default pretrained model `"stopwords_en"`.

For available pretrained models please see the Models Hub.

For extended examples of usage, see the Examples and StopWordsCleanerTestSpec.

NOTE: If you need to setStopWords from a text file, you can first read and convert it into an array of string as follows.

# your stop words text file, each line is one stop word
stopwords = sc.textFile("/tmp/stopwords/english.txt").collect()

# simply use it in StopWordsCleaner
stopWordsCleaner = StopWordsCleaner()\
      .setInputCols("token")\
      .setOutputCol("cleanTokens")\
      .setStopWords(stopwords)\
      .setCaseSensitive(False)

# or you can use pretrained models for StopWordsCleaner
stopWordsCleaner = StopWordsCleaner.pretrained()
      .setInputCols("token")\
      .setOutputCol("cleanTokens")\
      .setCaseSensitive(False)

// your stop words text file, each line is one stop word
val stopwords = sc.textFile("/tmp/stopwords/english.txt").collect()

// simply use it in StopWordsCleaner
val stopWordsCleaner = new StopWordsCleaner()
      .setInputCols("token")
      .setOutputCol("cleanTokens")
      .setStopWords(stopwords)
      .setCaseSensitive(false)

// or you can use pretrained models for StopWordsCleaner
val stopWordsCleaner = StopWordsCleaner.pretrained()
      .setInputCols("token")
      .setOutputCol("cleanTokens")
      .setCaseSensitive(false)

Input Annotator Types: TOKEN

Output Annotator Type: TOKEN

Python API: StopWordsCleaner

Scala API: StopWordsCleaner

Source: StopWordsCleaner

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

stopWords = StopWordsCleaner() \
    .setInputCols(["token"]) \
    .setOutputCol("cleanTokens") \
    .setCaseSensitive(False)

pipeline = Pipeline().setStages([
      documentAssembler,
      sentenceDetector,
      tokenizer,
      stopWords
    ])

data = spark.createDataFrame([
    ["This is my first sentence. This is my second."],
    ["This is my third sentence. This is my forth."]
]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("cleanTokens.result").show(truncate=False)
+-------------------------------+
|result                         |
+-------------------------------+
|[first, sentence, ., second, .]|
|[third, sentence, ., forth, .] |
+-------------------------------+

import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.Tokenizer
import com.johnsnowlabs.nlp.annotator.SentenceDetector
import com.johnsnowlabs.nlp.annotators.StopWordsCleaner
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val stopWords = new StopWordsCleaner()
  .setInputCols("token")
  .setOutputCol("cleanTokens")
  .setCaseSensitive(false)

val pipeline = new Pipeline().setStages(Array(
    documentAssembler,
    sentenceDetector,
    tokenizer,
    stopWords
  ))

val data = Seq(
  "This is my first sentence. This is my second.",
  "This is my third sentence. This is my forth."
).toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("cleanTokens.result").show(false)
+-------------------------------+
|result                         |
+-------------------------------+
|[first, sentence, ., second, .]|
|[third, sentence, ., forth, .] |
+-------------------------------+

SymmetricDelete Spellchecker

Trains a Symmetric Delete spelling correction algorithm. Retrieves tokens and utilizes distance metrics to compute possible derived words.

Inspired by SymSpell.

For instantiated/pretrained models, see SymmetricDeleteModel.

See SymmetricDeleteModelTestSpec for further reference.

Input Annotator Types: TOKEN

Output Annotator Type: TOKEN

Python API: SymmetricDeleteApproach

Scala API: SymmetricDeleteApproach

Source: SymmetricDeleteApproach

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
# In this example, the dictionary `"words.txt"` has the form of
#
# ...
# gummy
# gummic
# gummier
# gummiest
# gummiferous
# ...
#
# This dictionary is then set to be the basis of the spell checker.

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

spellChecker = SymmetricDeleteApproach() \
    .setInputCols(["token"]) \
    .setOutputCol("spell") \
    .setDictionary("src/test/resources/spell/words.txt")

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    spellChecker
])

pipelineModel = pipeline.fit(trainingData)

// In this example, the dictionary `"words.txt"` has the form of
//
// ...
// gummy
// gummic
// gummier
// gummiest
// gummiferous
// ...
//
// This dictionary is then set to be the basis of the spell checker.
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.spell.symmetric.SymmetricDeleteApproach
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val spellChecker = new SymmetricDeleteApproach()
  .setInputCols("token")
  .setOutputCol("spell")
  .setDictionary("src/test/resources/spell/words.txt")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  tokenizer,
  spellChecker
))

val pipelineModel = pipeline.fit(trainingData)

TextMatcher

Annotator to match exact phrases (by token) provided in a file against a Document.

A text file of predefined phrases must be provided with setEntities.

For extended examples of usage, see the Examples and the TextMatcherTestSpec.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: CHUNK

Python API: TextMatcher

Scala API: TextMatcher

Source: TextMatcher

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
# In this example, the entities file is of the form
#
# ...
# dolore magna aliqua
# lorem ipsum dolor. sit
# laborum
# ...
#
# where each line represents an entity phrase to be extracted.

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

data = spark.createDataFrame([["Hello dolore magna aliqua. Lorem ipsum dolor. sit in laborum"]]).toDF("text")
entityExtractor = TextMatcher() \
    .setInputCols(["document", "token"]) \
    .setEntities("src/test/resources/entity-extractor/test-phrases.txt", ReadAs.TEXT) \
    .setOutputCol("entity") \
    .setCaseSensitive(False)

pipeline = Pipeline().setStages([documentAssembler, tokenizer, entityExtractor])
results = pipeline.fit(data).transform(data)

results.selectExpr("explode(entity) as result").show(truncate=False)
+------------------------------------------------------------------------------------------+
|result                                                                                    |
+------------------------------------------------------------------------------------------+
|[chunk, 6, 24, dolore magna aliqua, [entity -> entity, sentence -> 0, chunk -> 0], []]    |
|[chunk, 27, 48, Lorem ipsum dolor. sit, [entity -> entity, sentence -> 0, chunk -> 1], []]|
|[chunk, 53, 59, laborum, [entity -> entity, sentence -> 0, chunk -> 2], []]               |
+------------------------------------------------------------------------------------------+

// In this example, the entities file is of the form
//
// ...
// dolore magna aliqua
// lorem ipsum dolor. sit
// laborum
// ...
//
// where each line represents an entity phrase to be extracted.
import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.Tokenizer
import com.johnsnowlabs.nlp.annotator.TextMatcher
import com.johnsnowlabs.nlp.util.io.ReadAs
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val data = Seq("Hello dolore magna aliqua. Lorem ipsum dolor. sit in laborum").toDF("text")
val entityExtractor = new TextMatcher()
  .setInputCols("document", "token")
  .setEntities("src/test/resources/entity-extractor/test-phrases.txt", ReadAs.TEXT)
  .setOutputCol("entity")
  .setCaseSensitive(false)
  .setTokenizer(tokenizer.fit(data))

val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, entityExtractor))
val results = pipeline.fit(data).transform(data)

results.selectExpr("explode(entity) as result").show(false)
+------------------------------------------------------------------------------------------+
|result                                                                                    |
+------------------------------------------------------------------------------------------+
|[chunk, 6, 24, dolore magna aliqua, [entity -> entity, sentence -> 0, chunk -> 0], []]    |
|[chunk, 27, 48, Lorem ipsum dolor. sit, [entity -> entity, sentence -> 0, chunk -> 1], []]|
|[chunk, 53, 59, laborum, [entity -> entity, sentence -> 0, chunk -> 2], []]               |
+------------------------------------------------------------------------------------------+

Token2Chunk

Converts TOKEN type Annotations to CHUNK type.

This can be useful if a entities have been already extracted as TOKEN and following annotators require CHUNK types.

Input Annotator Types: TOKEN

Output Annotator Type: CHUNK

Python API: Token2Chunk

Scala API: Token2Chunk

Source: Token2Chunk

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline


documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

token2chunk = Token2Chunk() \
    .setInputCols(["token"]) \
    .setOutputCol("chunk")

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    token2chunk
])

data = spark.createDataFrame([["One Two Three Four"]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("explode(chunk) as result").show(truncate=False)
+------------------------------------------+
|result                                    |
+------------------------------------------+
|[chunk, 0, 2, One, [sentence -> 0], []]   |
|[chunk, 4, 6, Two, [sentence -> 0], []]   |
|[chunk, 8, 12, Three, [sentence -> 0], []]|
|[chunk, 14, 17, Four, [sentence -> 0], []]|
+------------------------------------------+

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.{Token2Chunk, Tokenizer}

import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val token2chunk = new Token2Chunk()
  .setInputCols("token")
  .setOutputCol("chunk")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  tokenizer,
  token2chunk
))

val data = Seq("One Two Three Four").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(chunk) as result").show(false)
+------------------------------------------+
|result                                    |
+------------------------------------------+
|[chunk, 0, 2, One, [sentence -> 0], []]   |
|[chunk, 4, 6, Two, [sentence -> 0], []]   |
|[chunk, 8, 12, Three, [sentence -> 0], []]|
|[chunk, 14, 17, Four, [sentence -> 0], []]|
+------------------------------------------+

TokenAssembler

This transformer reconstructs a DOCUMENT type annotation from tokens, usually after these have been normalized, lemmatized, normalized, spell checked, etc, in order to use this document annotation in further annotators. Requires DOCUMENT and TOKEN type annotations as input.

For more extended examples on document pre-processing see the Examples.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: DOCUMENT

Python API: TokenAssembler

Scala API: TokenAssembler

Source: TokenAssembler

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

# First, the text is tokenized and cleaned
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentences")

tokenizer = Tokenizer() \
    .setInputCols(["sentences"]) \
    .setOutputCol("token")

normalizer = Normalizer() \
    .setInputCols(["token"]) \
    .setOutputCol("normalized") \
    .setLowercase(False)

stopwordsCleaner = StopWordsCleaner() \
    .setInputCols(["normalized"]) \
    .setOutputCol("cleanTokens") \
    .setCaseSensitive(False)

# Then the TokenAssembler turns the cleaned tokens into a `DOCUMENT` type structure.
tokenAssembler = TokenAssembler() \
    .setInputCols(["sentences", "cleanTokens"]) \
    .setOutputCol("cleanText")

data = spark.createDataFrame([["Spark NLP is an open-source text processing library for advanced natural language processing."]]) \
    .toDF("text")

pipeline = Pipeline().setStages([
    documentAssembler,
    sentenceDetector,
    tokenizer,
    normalizer,
    stopwordsCleaner,
    tokenAssembler
]).fit(data)

result = pipeline.transform(data)
result.select("cleanText").show(truncate=False)
+---------------------------------------------------------------------------------------------------------------------------+
|cleanText                                                                                                                  |
+---------------------------------------------------------------------------------------------------------------------------+
|0, 80, Spark NLP opensource text processing library advanced natural language processing, [sentence -> 0], []|
+---------------------------------------------------------------------------------------------------------------------------+

import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.SentenceDetector
import com.johnsnowlabs.nlp.annotator.Tokenizer
import com.johnsnowlabs.nlp.annotator.{Normalizer, StopWordsCleaner}
import com.johnsnowlabs.nlp.TokenAssembler
import org.apache.spark.ml.Pipeline

// First, the text is tokenized and cleaned
val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentences")

val tokenizer = new Tokenizer()
  .setInputCols("sentences")
  .setOutputCol("token")

val normalizer = new Normalizer()
  .setInputCols("token")
  .setOutputCol("normalized")
  .setLowercase(false)

val stopwordsCleaner = new StopWordsCleaner()
  .setInputCols("normalized")
  .setOutputCol("cleanTokens")
  .setCaseSensitive(false)

// Then the TokenAssembler turns the cleaned tokens into a `DOCUMENT` type structure.
val tokenAssembler = new TokenAssembler()
  .setInputCols("sentences", "cleanTokens")
  .setOutputCol("cleanText")

val data = Seq("Spark NLP is an open-source text processing library for advanced natural language processing.")
  .toDF("text")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  normalizer,
  stopwordsCleaner,
  tokenAssembler
)).fit(data)

val result = pipeline.transform(data)
result.select("cleanText").show(false)
+---------------------------------------------------------------------------------------------------------------------------+
|cleanText                                                                                                                  |
+---------------------------------------------------------------------------------------------------------------------------+
|[[document, 0, 80, Spark NLP opensource text processing library advanced natural language processing, [sentence -> 0], []]]|
+---------------------------------------------------------------------------------------------------------------------------+

Tokenizer

Tokenizes raw text in document type columns into TokenizedSentence .

This class represents a non fitted tokenizer. Fitting it will cause the internal RuleFactory to construct the rules for tokenizing from the input configuration.

Identifies tokens with tokenization open standards. A few rules will help customizing it if defaults do not fit user needs.

For extended examples of usage see the Examples and Tokenizer test class

Input Annotator Types: DOCUMENT

Output Annotator Type: TOKEN

Note: All these APIs receive regular expressions so please make sure that you escape special characters according to Java conventions.

Python API: Tokenizer

Scala API: Tokenizer

Source: Tokenizer

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

data = spark.createDataFrame([["I'd like to say we didn't expect that. Jane's boyfriend."]]).toDF("text")
documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
tokenizer = Tokenizer().setInputCols(["document"]).setOutputCol("token").fit(data)

pipeline = Pipeline().setStages([documentAssembler, tokenizer]).fit(data)
result = pipeline.transform(data)

result.selectExpr("token.result").show(truncate=False)
+-----------------------------------------------------------------------+
|output                                                                 |
+-----------------------------------------------------------------------+
|[I'd, like, to, say, we, didn't, expect, that, ., Jane's, boyfriend, .]|
+-----------------------------------------------------------------------+

import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import org.apache.spark.ml.Pipeline

val data = Seq("I'd like to say we didn't expect that. Jane's boyfriend.").toDF("text")
val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document")
val tokenizer = new Tokenizer().setInputCols("document").setOutputCol("token").fit(data)

val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer)).fit(data)
val result = pipeline.transform(data)

result.selectExpr("token.result").show(false)
+-----------------------------------------------------------------------+
|output                                                                 |
+-----------------------------------------------------------------------+
|[I'd, like, to, say, we, didn't, expect, that, ., Jane's, boyfriend, .]|
+-----------------------------------------------------------------------+

TypedDependencyParser

Labeled parser that finds a grammatical relation between two words in a sentence. Its input is either a CoNLL2009 or ConllU dataset.

For instantiated/pretrained models, see TypedDependencyParserModel.

Dependency parsers provide information about word relationship. For example, dependency parsing can tell you what the subjects and objects of a verb are, as well as which words are modifying (describing) the subject. This can help you find precise answers to specific questions.

The parser requires the dependant tokens beforehand with e.g. DependencyParser. The required training data can be set in two different ways (only one can be chosen for a particular model):

Dataset in the CoNLL 2009 format set with setConll2009
Dataset in the CoNLL-U format set with setConllU

Apart from that, no additional training data is needed.

See TypedDependencyParserApproachTestSpec for further reference on this API.

Input Annotator Types: TOKEN, POS, DEPENDENCY

Output Annotator Type: LABELED_DEPENDENCY

Python API: TypedDependencyParserApproach

Scala API: TypedDependencyParserApproach

Source: TypedDependencyParserApproach

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

posTagger = PerceptronModel.pretrained() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("pos")

dependencyParser = DependencyParserModel.pretrained() \
    .setInputCols(["sentence", "pos", "token"]) \
    .setOutputCol("dependency")

typedDependencyParser = TypedDependencyParserApproach() \
    .setInputCols(["dependency", "pos", "token"]) \
    .setOutputCol("dependency_type") \
    .setConllU("src/test/resources/parser/labeled/train_small.conllu.txt") \
    .setNumberOfIterations(1)

pipeline = Pipeline().setStages([
    documentAssembler,
    sentence,
    tokenizer,
    posTagger,
    dependencyParser,
    typedDependencyParser
])

# Additional training data is not needed, the dependency parser relies on CoNLL-U only.
emptyDataSet = spark.createDataFrame([[""]]).toDF("text")
pipelineModel = pipeline.fit(emptyDataSet)

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel
import com.johnsnowlabs.nlp.annotators.parser.dep.DependencyParserModel
import com.johnsnowlabs.nlp.annotators.parser.typdep.TypedDependencyParserApproach
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentence = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

val posTagger = PerceptronModel.pretrained()
  .setInputCols("sentence", "token")
  .setOutputCol("pos")

val dependencyParser = DependencyParserModel.pretrained()
  .setInputCols("sentence", "pos", "token")
  .setOutputCol("dependency")

val typedDependencyParser = new TypedDependencyParserApproach()
  .setInputCols("dependency", "pos", "token")
  .setOutputCol("dependency_type")
  .setConllU("src/test/resources/parser/labeled/train_small.conllu.txt")
  .setNumberOfIterations(1)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentence,
  tokenizer,
  posTagger,
  dependencyParser,
  typedDependencyParser
))

// Additional training data is not needed, the dependency parser relies on CoNLL-U only.
val emptyDataSet = Seq.empty[String].toDF("text")
val pipelineModel = pipeline.fit(emptyDataSet)

ViveknSentiment

Trains a sentiment analyser inspired by the algorithm by Vivek Narayanan https://github.com/vivekn/sentiment/.

The algorithm is based on the paper “Fast and accurate sentiment classification using an enhanced Naive Bayes model”.

The analyzer requires sentence boundaries to give a score in context. Tokenization is needed to make sure tokens are within bounds. Transitivity requirements are also required.

The training data needs to consist of a column for normalized text and a label column (either "positive" or "negative").

For extended examples of usage, see the Examples and the ViveknSentimentTestSpec.

Input Annotator Types: TOKEN, DOCUMENT

Output Annotator Type: SENTIMENT

Python API: ViveknSentimentApproach

Scala API: ViveknSentimentApproach

Source: ViveknSentimentApproach

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

document = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

token = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

normalizer = Normalizer() \
    .setInputCols(["token"]) \
    .setOutputCol("normal")

vivekn = ViveknSentimentApproach() \
    .setInputCols(["document", "normal"]) \
    .setSentimentCol("train_sentiment") \
    .setOutputCol("result_sentiment")

finisher = Finisher() \
    .setInputCols(["result_sentiment"]) \
    .setOutputCols("final_sentiment")

pipeline = Pipeline().setStages([document, token, normalizer, vivekn, finisher])

training = spark.createDataFrame([
    ("I really liked this movie!", "positive"),
    ("The cast was horrible", "negative"),
    ("Never going to watch this again or recommend it to anyone", "negative"),
    ("It's a waste of time", "negative"),
    ("I loved the protagonist", "positive"),
    ("The music was really really good", "positive")
]).toDF("text", "train_sentiment")
pipelineModel = pipeline.fit(training)

data = spark.createDataFrame([
    ["I recommend this movie"],
    ["Dont waste your time!!!"]
]).toDF("text")
result = pipelineModel.transform(data)

result.select("final_sentiment").show(truncate=False)
+---------------+
|final_sentiment|
+---------------+
|[positive]     |
|[negative]     |
+---------------+

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.Normalizer
import com.johnsnowlabs.nlp.annotators.sda.vivekn.ViveknSentimentApproach
import com.johnsnowlabs.nlp.Finisher
import org.apache.spark.ml.Pipeline

val document = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val token = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val normalizer = new Normalizer()
  .setInputCols("token")
  .setOutputCol("normal")

val vivekn = new ViveknSentimentApproach()
  .setInputCols("document", "normal")
  .setSentimentCol("train_sentiment")
  .setOutputCol("result_sentiment")

val finisher = new Finisher()
  .setInputCols("result_sentiment")
  .setOutputCols("final_sentiment")

val pipeline = new Pipeline().setStages(Array(document, token, normalizer, vivekn, finisher))

val training = Seq(
  ("I really liked this movie!", "positive"),
  ("The cast was horrible", "negative"),
  ("Never going to watch this again or recommend it to anyone", "negative"),
  ("It's a waste of time", "negative"),
  ("I loved the protagonist", "positive"),
  ("The music was really really good", "positive")
).toDF("text", "train_sentiment")
val pipelineModel = pipeline.fit(training)

val data = Seq(
  "I recommend this movie",
  "Dont waste your time!!!"
).toDF("text")
val result = pipelineModel.transform(data)

result.select("final_sentiment").show(false)
+---------------+
|final_sentiment|
+---------------+
|[positive]     |
|[negative]     |
+---------------+

Word2Vec

Trains a Word2Vec model that creates vector representations of words in a text corpus.

For instantiated/pretrained models, see Word2VecModel.

Sources :

For the original C implementation, see https://code.google.com/p/word2vec/

For the research paper, see Efficient Estimation of Word Representations in Vector Space and Distributed Representations of Words and Phrases and their Compositionality.

Input Annotator Types: TOKEN

Output Annotator Type: WORD_EMBEDDINGS

Python API: Word2VecApproach

Scala API: Word2VecApproach

Source: Word2VecApproach

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

embeddings = Word2VecApproach() \
    .setInputCols(["token"]) \
    .setOutputCol("embeddings")

pipeline = Pipeline() \
    .setStages([
      documentAssembler,
      tokenizer,
      embeddings
    ])

path = "sherlockholmes.txt"
dataset = spark.read.text(path).toDF("text")
pipelineModel = pipeline.fit(dataset)

import spark.implicits._
import com.johnsnowlabs.nlp.annotator.{Tokenizer, Word2VecApproach}
import com.johnsnowlabs.nlp.base.DocumentAssembler
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols(Array("document"))
  .setOutputCol("token")

val embeddings = new Word2VecApproach()
  .setInputCols("token")
  .setOutputCol("embeddings")

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    tokenizer,
    embeddings
  ))

val path = "src/test/resources/spell/sherlockholmes.txt"
val dataset = spark.sparkContext.textFile(path)
  .toDF("text")
val pipelineModel = pipeline.fit(dataset)

Word2Vec model that creates vector representations of words in a text corpus.

This is the instantiated model of the Word2VecApproach. For training your own model, please see the documentation of that class.

Pretrained models can be loaded with pretrained of the companion object:

val embeddings = Word2VecModel.pretrained()
  .setInputCols("token")
  .setOutputCol("embeddings")

The default model is "word2vec_gigaword_300", if no name is provided.

For available pretrained models please see the Models Hub.

Sources :

For the original C implementation, see https://code.google.com/p/word2vec/

For the research paper, see Efficient Estimation of Word Representations in Vector Space and Distributed Representations of Words and Phrases and their Compositionality.

Input Annotator Types: TOKEN

Output Annotator Type: WORD_EMBEDDINGS

Python API: Word2VecModel

Scala API: Word2VecModel

Source: Word2VecModel

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

embeddings = Word2VecModel.pretrained() \
    .setInputCols(["token"]) \
    .setOutputCol("embeddings")

embeddingsFinisher = EmbeddingsFinisher() \
    .setInputCols(["embeddings"]) \
    .setOutputCols("finished_embeddings") \
    .setOutputAsVector(True)

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    embeddings,
    embeddingsFinisher
])
data = spark.createDataFrame([["This is a sentence."]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("explode(finished_embeddings) as result").show(1, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[0.06222493574023247,0.011579325422644615,0.009919632226228714,0.109361454844...|
+--------------------------------------------------------------------------------+

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.{Tokenizer, Word2VecModel}
import com.johnsnowlabs.nlp.EmbeddingsFinisher

import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols(Array("document"))
  .setOutputCol("token")

val embeddings = Word2VecModel.pretrained()
  .setInputCols("token")
  .setOutputCol("embeddings")

val embeddingsFinisher = new EmbeddingsFinisher()
  .setInputCols("embeddings")
  .setOutputCols("finished_embeddings")
  .setOutputAsVector(true)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  tokenizer,
  embeddings,
  embeddingsFinisher
))

val data = Seq("This is a sentence.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(1, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[0.06222493574023247,0.011579325422644615,0.009919632226228714,0.109361454844...|
+--------------------------------------------------------------------------------+

WordEmbeddings

Word Embeddings lookup annotator that maps tokens to vectors.

For instantiated/pretrained models, see WordEmbeddingsModel.

A custom token lookup dictionary for embeddings can be set with setStoragePath. Each line of the provided file needs to have a token, followed by their vector representation, delimited by a spaces.

...
are 0.39658191506190343 0.630968081620067 0.5393722253731201 0.8428180123359783
were 0.7535235923631415 0.9699218875629833 0.10397182122983872 0.11833962569383116
stress 0.0492683418305907 0.9415954572751959 0.47624463167525755 0.16790967216778263
induced 0.1535748762292387 0.33498936903209897 0.9235178224122094 0.1158772920395934
...

If a token is not found in the dictionary, then the result will be a zero vector of the same dimension. Statistics about the rate of converted tokens, can be retrieved with[WordEmbeddingsModel.withCoverageColumn and WordEmbeddingsModel.overallCoverage.

For extended examples of usage, see the Examples and the WordEmbeddingsTestSpec.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: WORD_EMBEDDINGS

Python API: WordEmbeddings

Scala API: WordEmbeddings

Source: WordEmbeddings

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
# In this example, the file `random_embeddings_dim4.txt` has the form of the content above.

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

embeddings = WordEmbeddings() \
    .setStoragePath("src/test/resources/random_embeddings_dim4.txt", ReadAs.TEXT) \
    .setStorageRef("glove_4d") \
    .setDimension(4) \
    .setInputCols(["document", "token"]) \
    .setOutputCol("embeddings")

embeddingsFinisher = EmbeddingsFinisher() \
    .setInputCols(["embeddings"]) \
    .setOutputCols("finished_embeddings") \
    .setOutputAsVector(True) \
    .setCleanAnnotations(False)

pipeline = Pipeline() \
    .setStages([
      documentAssembler,
      tokenizer,
      embeddings,
      embeddingsFinisher
    ])

data = spark.createDataFrame([["The patient was diagnosed with diabetes."]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(truncate=False)
+----------------------------------------------------------------------------------+
|result                                                                            |
+----------------------------------------------------------------------------------+
|[0.9439099431037903,0.4707513153553009,0.806300163269043,0.16176554560661316]     |
|[0.7966810464859009,0.5551124811172485,0.8861005902290344,0.28284206986427307]    |
|[0.025029370561242104,0.35177749395370483,0.052506182342767715,0.1887107789516449]|
|[0.08617766946554184,0.8399239182472229,0.5395117998123169,0.7864698767662048]    |
|[0.6599600911140442,0.16109347343444824,0.6041093468666077,0.8913561105728149]    |
|[0.5955275893211365,0.01899011991918087,0.4397728443145752,0.8911281824111938]    |
|[0.9840458631515503,0.7599489092826843,0.9417727589607239,0.8624503016471863]     |
+----------------------------------------------------------------------------------+

// In this example, the file `random_embeddings_dim4.txt` has the form of the content above.
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.embeddings.WordEmbeddings
import com.johnsnowlabs.nlp.util.io.ReadAs
import com.johnsnowlabs.nlp.EmbeddingsFinisher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols(Array("document"))
  .setOutputCol("token")

val embeddings = new WordEmbeddings()
  .setStoragePath("src/test/resources/random_embeddings_dim4.txt", ReadAs.TEXT)
  .setStorageRef("glove_4d")
  .setDimension(4)
  .setInputCols("document", "token")
  .setOutputCol("embeddings")

val embeddingsFinisher = new EmbeddingsFinisher()
  .setInputCols("embeddings")
  .setOutputCols("finished_embeddings")
  .setOutputAsVector(true)
  .setCleanAnnotations(false)

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    tokenizer,
    embeddings,
    embeddingsFinisher
  ))

val data = Seq("The patient was diagnosed with diabetes.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(false)
+----------------------------------------------------------------------------------+
|result                                                                            |
+----------------------------------------------------------------------------------+
|[0.9439099431037903,0.4707513153553009,0.806300163269043,0.16176554560661316]     |
|[0.7966810464859009,0.5551124811172485,0.8861005902290344,0.28284206986427307]    |
|[0.025029370561242104,0.35177749395370483,0.052506182342767715,0.1887107789516449]|
|[0.08617766946554184,0.8399239182472229,0.5395117998123169,0.7864698767662048]    |
|[0.6599600911140442,0.16109347343444824,0.6041093468666077,0.8913561105728149]    |
|[0.5955275893211365,0.01899011991918087,0.4397728443145752,0.8911281824111938]    |
|[0.9840458631515503,0.7599489092826843,0.9417727589607239,0.8624503016471863]     |
+----------------------------------------------------------------------------------+

WordSegmenter

Trains a WordSegmenter which tokenizes non-english or non-whitespace separated texts.

Many languages are not whitespace separated and their sentences are a concatenation of many symbols, like Korean, Japanese or Chinese. Without understanding the language, splitting the words into their corresponding tokens is impossible. The WordSegmenter is trained to understand these languages and split them into semantically correct parts.

This annotator is based on the paper Chinese Word Segmentation as Character Tagging [1]. Word segmentation is treated as a tagging problem. Each character is be tagged as on of four different labels: LL (left boundary), RR (right boundary), MM (middle) and LR (word by itself). The label depends on the position of the word in the sentence. LL tagged words will combine with the word on the right. Likewise, RR tagged words combine with words on the left. MM tagged words are treated as the middle of the word and combine with either side. LR tagged words are words by themselves.

Example (from [1], Example 3(a) (raw), 3(b) (tagged), 3(c) (translation)):

上海计划到本世纪末实现人均国内生产总值五千美元
上/LL 海/RR 计/LL 划/RR 到/LR 本/LR 世/LL 纪/RR 末/LR 实/LL 现/RR 人/LL 均/RR 国/LL 内/RR 生/LL 产/RR 总/LL 值/RR 五/LL 千/RR 美/LL 元/RR
Shanghai plans to reach the goal of 5,000 dollars in per capita GDP by the end of the century.

For instantiated/pretrained models, see WordSegmenterModel.

To train your own model, a training dataset consisting of Part-Of-Speech tags is required. The data has to be loaded into a dataframe, where the column is an Annotation of type "POS". This can be set with setPosColumn.

Tip: The helper class POS might be useful to read training data into data frames.

For extended examples of usage, see the Examples and the WordSegmenterTest.

References:

[1] Xue, Nianwen. “Chinese Word Segmentation as Character Tagging.” International Journal of Computational Linguistics & Chinese Language Processing, Volume 8, Number 1, February 2003: Special Issue on Word Formation and Chinese Language Processing, 2003, pp. 29-48. ACLWeb, https://aclanthology.org/O03-4002.

Input Annotator Types: DOCUMENT

Output Annotator Type: TOKEN

Python API: WordSegmenterApproach

Scala API: WordSegmenterApproach

Source: WordSegmenterApproach

Show Example

# In this example, `"chinese_train.utf8"` is in the form of
#
# 十|LL 四|RR 不|LL 是|RR 四|LL 十|RR
#
# and is loaded with the `POS` class to create a dataframe of `"POS"` type Annotations.

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

wordSegmenter = WordSegmenterApproach() \
    .setInputCols(["document"]) \
    .setOutputCol("token") \
    .setPosColumn("tags") \
    .setNIterations(5)

pipeline = Pipeline().setStages([
    documentAssembler,
    wordSegmenter
])

trainingDataSet = POS().readDataset(
    spark,
    "src/test/resources/word-segmenter/chinese_train.utf8"
)

pipelineModel = pipeline.fit(trainingDataSet)

// In this example, `"chinese_train.utf8"` is in the form of
//
// 十|LL 四|RR 不|LL 是|RR 四|LL 十|RR
//
// and is loaded with the `POS` class to create a dataframe of `"POS"` type Annotations.

import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.ws.WordSegmenterApproach
import com.johnsnowlabs.nlp.training.POS
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val wordSegmenter = new WordSegmenterApproach()
  .setInputCols("document")
  .setOutputCol("token")
  .setPosColumn("tags")
  .setNIterations(5)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  wordSegmenter
))

val trainingDataSet = POS().readDataset(
  ResourceHelper.spark,
  "src/test/resources/word-segmenter/chinese_train.utf8"
)

val pipelineModel = pipeline.fit(trainingDataSet)

WordSegmenter which tokenizes non-english or non-whitespace separated texts.

This annotator is based on the paper Chinese Word Segmentation as Character Tagging. Word segmentation is treated as a tagging problem. Each character is be tagged as on of four different labels: LL (left boundary), RR (right boundary), MM (middle) and LR (word by itself). The label depends on the position of the word in the sentence. LL tagged words will combine with the word on the right. Likewise, RR tagged words combine with words on the left. MM tagged words are treated as the middle of the word and combine with either side. LR tagged words are words by themselves.

Example (from [1], Example 3(a) (raw), 3(b) (tagged), 3(c) (translation)):

上海计划到本世纪末实现人均国内生产总值五千美元
上/LL 海/RR 计/LL 划/RR 到/LR 本/LR 世/LL 纪/RR 末/LR 实/LL 现/RR 人/LL 均/RR 国/LL 内/RR 生/LL 产/RR 总/LL 值/RR 五/LL 千/RR 美/LL 元/RR
Shanghai plans to reach the goal of 5,000 dollars in per capita GDP by the end of the century.

This is the instantiated model of the WordSegmenterApproach. For training your own model, please see the documentation of that class.

Pretrained models can be loaded with pretrained of the companion object:

val wordSegmenter = WordSegmenterModel.pretrained()
  .setInputCols("document")
  .setOutputCol("words_segmented")

The default model is "wordseg_pku", default language is "zh", if no values are provided. For available pretrained models please see the Models Hub.

For extended examples of usage, see the Examples and the WordSegmenterTest.

References:

[1] Xue, Nianwen. “Chinese Word Segmentation as Character Tagging.” International Journal of Computational Linguistics & Chinese Language Processing, Volume 8, Number 1, February 2003: Special Issue on Word Formation and Chinese Language Processing, 2003, pp. 29-48. ACLWeb, https://aclanthology.org/O03-4002.

Input Annotator Types: DOCUMENT

Output Annotator Type: TOKEN

Python API: WordSegmenterModel

Scala API: WordSegmenterModel

Source: WordSegmenterModel

YakeKeywordExtraction

Yake is an Unsupervised, Corpus-Independent, Domain and Language-Independent and Single-Document keyword extraction algorithm.

Extracting keywords from texts has become a challenge for individuals and organizations as the information grows in complexity and size. The need to automate this task so that text can be processed in a timely and adequate manner has led to the emergence of automatic keyword extraction tools. Yake is a novel feature-based system for multi-lingual keyword extraction, which supports texts of different sizes, domain or languages. Unlike other approaches, Yake does not rely on dictionaries nor thesauri, neither is trained against any corpora. Instead, it follows an unsupervised approach which builds upon features extracted from the text, making it thus applicable to documents written in different languages without the need for further knowledge. This can be beneficial for a large number of tasks and a plethora of situations where access to training corpora is either limited or restricted. The algorithm makes use of the position of a sentence and token. Therefore, to use the annotator, the text should be first sent through a Sentence Boundary Detector and then a tokenizer.

Note that each keyword will be given a keyword score greater than 0 (The lower the score better the keyword). Therefore to filter the keywords, an upper bound for the score can be set with setThreshold.

For extended examples of usage, see the Examples and the YakeTestSpec.

Sources :

Campos, R., Mangaravite, V., Pasquali, A., Jatowt, A., Jorge, A., Nunes, C. and Jatowt, A. (2020). YAKE! Keyword Extraction from Single Documents using Multiple Local Features. In Information Sciences Journal. Elsevier, Vol 509, pp 257-289

Paper abstract:

As the amount of generated information grows, reading and summarizing texts of large collections turns into a challenging task. Many documents do not come with descriptive terms, thus requiring humans to generate keywords on-the-fly. The need to automate this kind of task demands the development of keyword extraction systems with the ability to automatically identify keywords within the text. One approach is to resort to machine-learning algorithms. These, however, depend on large annotated text corpora, which are not always available. An alternative solution is to consider an unsupervised approach. In this article, we describe YAKE!, a light-weight unsupervised automatic keyword extraction method which rests on statistical text features extracted from single documents to select the most relevant keywords of a text. Our system does not need to be trained on a particular set of documents, nor does it depend on dictionaries, external corpora, text size, language, or domain. To demonstrate the merits and significance of YAKE!, we compare it against ten state-of-the-art unsupervised approaches and one supervised method. Experimental results carried out on top of twenty datasets show that YAKE! significantly outperforms other unsupervised methods on texts of different sizes, languages, and domains.

Input Annotator Types: TOKEN

Output Annotator Type: CHUNK

Python API: YakeKeywordExtraction

Scala API: YakeKeywordExtraction

Source: YakeKeywordExtraction

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

token = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token") \
    .setContextChars(["(", "]", "?", "!", ".", ","])

keywords = YakeKeywordExtraction() \
    .setInputCols(["token"]) \
    .setOutputCol("keywords") \
    .setThreshold(0.6) \
    .setMinNGrams(2) \
    .setNKeywords(10)

pipeline = Pipeline().setStages([
    documentAssembler,
    sentenceDetector,
    token,
    keywords
])

data = spark.createDataFrame([[
    "Sources tell us that Google is acquiring Kaggle, a platform that hosts data science and machine learning competitions. Details about the transaction remain somewhat vague, but given that Google is hosting its Cloud Next conference in San Francisco this week, the official announcement could come as early as tomorrow. Reached by phone, Kaggle co-founder CEO Anthony Goldbloom declined to deny that the acquisition is happening. Google itself declined 'to comment on rumors'. Kaggle, which has about half a million data scientists on its platform, was founded by Goldbloom  and Ben Hamner in 2010. The service got an early start and even though it has a few competitors like DrivenData, TopCoder and HackerRank, it has managed to stay well ahead of them by focusing on its specific niche. The service is basically the de facto home for running data science and machine learning competitions. With Kaggle, Google is buying one of the largest and most active communities for data scientists - and with that, it will get increased mindshare in this community, too (though it already has plenty of that thanks to Tensorflow and other projects). Kaggle has a bit of a history with Google, too, but that's pretty recent. Earlier this month, Google and Kaggle teamed up to host a $100,000 machine learning competition around classifying YouTube videos. That competition had some deep integrations with the Google Cloud Platform, too. Our understanding is that Google will keep the service running - likely under its current name. While the acquisition is probably more about Kaggle's community than technology, Kaggle did build some interesting tools for hosting its competition and 'kernels', too. On Kaggle, kernels are basically the source code for analyzing data sets and developers can share this code on the platform (the company previously called them 'scripts'). Like similar competition-centric sites, Kaggle also runs a job board, too. It's unclear what Google will do with that part of the service. According to Crunchbase, Kaggle raised $12.5 million (though PitchBook says it's $12.75) since its   launch in 2010. Investors in Kaggle include Index Ventures, SV Angel, Max Levchin, NaRavikant, Google chie economist Hal Varian, Khosla Ventures and Yuri Milner"
]]).toDF("text")
result = pipeline.fit(data).transform(data)

# combine the result and score (contained in keywords.metadata)
scores = result \
    .selectExpr("explode(arrays_zip(keywords.result, keywords.metadata)) as resultTuples") \
    .selectExpr("resultTuples['0'] as keyword", "resultTuples['1'].score as score")

# Order ascending, as lower scores means higher importance
scores.orderBy("score").show(5, truncate = False)
+---------------------+-------------------+
|keyword              |score              |
+---------------------+-------------------+
|google cloud         |0.32051516486864573|
|google cloud platform|0.37786450577630676|
|ceo anthony goldbloom|0.39922830978423146|
|san francisco        |0.40224744669493756|
|anthony goldbloom    |0.41584827825302534|
+---------------------+-------------------+

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.{SentenceDetector, Tokenizer}
import com.johnsnowlabs.nlp.annotators.keyword.yake.YakeKeywordExtraction
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val token = new Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")
  .setContextChars(Array("(", ")", "?", "!", ".", ","))

val keywords = new YakeKeywordExtraction()
  .setInputCols("token")
  .setOutputCol("keywords")
  .setThreshold(0.6f)
  .setMinNGrams(2)
  .setNKeywords(10)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  token,
  keywords
))

val data = Seq(
  "Sources tell us that Google is acquiring Kaggle, a platform that hosts data science and machine learning competitions. Details about the transaction remain somewhat vague, but given that Google is hosting its Cloud Next conference in San Francisco this week, the official announcement could come as early as tomorrow. Reached by phone, Kaggle co-founder CEO Anthony Goldbloom declined to deny that the acquisition is happening. Google itself declined 'to comment on rumors'. Kaggle, which has about half a million data scientists on its platform, was founded by Goldbloom  and Ben Hamner in 2010. The service got an early start and even though it has a few competitors like DrivenData, TopCoder and HackerRank, it has managed to stay well ahead of them by focusing on its specific niche. The service is basically the de facto home for running data science and machine learning competitions. With Kaggle, Google is buying one of the largest and most active communities for data scientists - and with that, it will get increased mindshare in this community, too (though it already has plenty of that thanks to Tensorflow and other projects). Kaggle has a bit of a history with Google, too, but that's pretty recent. Earlier this month, Google and Kaggle teamed up to host a $100,000 machine learning competition around classifying YouTube videos. That competition had some deep integrations with the Google Cloud Platform, too. Our understanding is that Google will keep the service running - likely under its current name. While the acquisition is probably more about Kaggle's community than technology, Kaggle did build some interesting tools for hosting its competition and 'kernels', too. On Kaggle, kernels are basically the source code for analyzing data sets and developers can share this code on the platform (the company previously called them 'scripts'). Like similar competition-centric sites, Kaggle also runs a job board, too. It's unclear what Google will do with that part of the service. According to Crunchbase, Kaggle raised $12.5 million (though PitchBook says it's $12.75) since its   launch in 2010. Investors in Kaggle include Index Ventures, SV Angel, Max Levchin, Naval Ravikant, Google chief economist Hal Varian, Khosla Ventures and Yuri Milner"
).toDF("text")
val result = pipeline.fit(data).transform(data)

// combine the result and score (contained in keywords.metadata)
val scores = result
  .selectExpr("explode(arrays_zip(keywords.result, keywords.metadata)) as resultTuples")
  .select($"resultTuples.0" as "keyword", $"resultTuples.1.score")

// Order ascending, as lower scores means higher importance
scores.orderBy("score").show(5, truncate = false)
+---------------------+-------------------+
|keyword              |score              |
+---------------------+-------------------+
|google cloud         |0.32051516486864573|
|google cloud platform|0.37786450577630676|
|ceo anthony goldbloom|0.39922830978423146|
|san francisco        |0.40224744669493756|
|anthony goldbloom    |0.41584827825302534|
+---------------------+-------------------+

PREVIOUSTasks

NEXTTransformers