Spark NLP - Annotators

 

How to read this section

All annotators in Spark NLP share a common interface, this is:

  • Annotation: Annotation(annotatorType, begin, end, result, meta-data, embeddings)
  • AnnotatorType: some annotators share a type. This is not only figurative, but also tells about the structure of the metadata map in the Annotation. This is the one referred in the input and output of annotators.
  • Inputs: Represents how many and which annotator types are expected in setInputCols(). These are column names of output of other annotators in the DataFrames.
  • Output Represents the type of the output in the column setOutputCol().

There are two types of Annotators:

  • Approach: AnnotatorApproach extend Estimators, which are meant to be trained through fit()
  • Model: AnnotatorModel extend from Transformers, which are meant to transform DataFrames through transform()

Model suffix is explicitly stated when the annotator is the result of a training process. Some annotators, such as Tokenizer are transformers, but do not contain the word Model since they are not trained annotators.

Model annotators have a pretrained() on it’s static object, to retrieve the public pre-trained version of a model.

  • pretrained(name, language, extra_location) -> by default, pre-trained will bring a default model, sometimes we offer more than one model, in this case, you may have to use name, language or extra location to download them.

Available Annotators

Annotator Description Version
AutoGGUFEmbeddings Annotator that uses the llama.cpp library to generate text embeddings with large language models. Opensource
AutoGGUFModel Annotator that uses the llama.cpp library to generate text completions with large language models. Opensource
BGEEmbeddings Sentence embeddings using BGE. Opensource
BigTextMatcher Annotator to match exact phrases (by token) provided in a file against a Document. Opensource
Chunk2Doc Converts a CHUNK type column back into DOCUMENT. Useful when trying to re-tokenize or do further analysis on a CHUNK result. Opensource
ChunkEmbeddings This annotator utilizes WordEmbeddings, BertEmbeddings etc. to generate chunk embeddings from either Chunker, NGramGenerator, or NerConverter outputs. Opensource
ChunkTokenizer Tokenizes and flattens extracted NER chunks. Opensource
Chunker This annotator matches a pattern of part-of-speech tags in order to return meaningful phrases from document. Opensource
ClassifierDL ClassifierDL for generic Multi-class Text Classification. Opensource
ContextSpellChecker Implements a deep-learning based Noisy Channel Model Spell Algorithm. Opensource
Date2Chunk Converts DATE type Annotations to CHUNK type. Opensource
DateMatcher Matches standard date formats into a provided format. Opensource
DependencyParser Unlabeled parser that finds a grammatical relation between two words in a sentence. Opensource
Doc2Chunk Converts DOCUMENT type annotations into CHUNK type with the contents of a chunkCol. Opensource
Doc2Vec Word2Vec model that creates vector representations of words in a text corpus. Opensource
DocumentAssembler Prepares data into a format that is processable by Spark NLP. This is the entry point for every Spark NLP pipeline. Opensource
DocumentCharacterTextSplitter Annotator which splits large documents into chunks of roughly given size. Opensource
DocumentNormalizer Annotator which normalizes raw text from tagged text, e.g. scraped web pages or xml documents, from document type columns into Sentence. Opensource
DocumentSimilarityRanker Annotator that uses LSH techniques present in Spark ML lib to execute approximate nearest neighbors search on top of sentence embeddings. Opensource
DocumentTokenSplitter Annotator that splits large documents into smaller documents based on the number of tokens in the text. Opensource
EntityRuler Fits an Annotator to match exact strings or regex patterns provided in a file against a Document and assigns them an named entity. Opensource
EmbeddingsFinisher Extracts embeddings from Annotations into a more easily usable form. Opensource
Finisher Converts annotation results into a format that easier to use. It is useful to extract the results from Spark NLP Pipelines. Opensource
GraphExtraction Extracts a dependency graph between entities. Opensource
GraphFinisher Helper class to convert the knowledge graph from GraphExtraction into a generic format, such as RDF. Opensource
ImageAssembler Prepares images read by Spark into a format that is processable by Spark NLP. Opensource
LanguageDetectorDL Language Identification and Detection by using CNN and RNN architectures in TensorFlow. Opensource
Lemmatizer Finds lemmas out of words with the objective of returning a base dictionary word. Opensource
MultiClassifierDL Multi-label Text Classification. Opensource
MultiDateMatcher Matches standard date formats into a provided format. Opensource
MultiDocumentAssembler Prepares data into a format that is processable by Spark NLP. Opensource
NGramGenerator A feature transformer that converts the input array of strings (annotatorType TOKEN) into an array of n-grams (annotatorType CHUNK). Opensource
NerConverter Converts a IOB or IOB2 representation of NER to a user-friendly one, by associating the tokens of recognized entities and their label. Opensource
NerCrf Extracts Named Entities based on a CRF Model. Opensource
NerDL This Named Entity recognition annotator is a generic NER model based on Neural Networks. Opensource
NerOverwriter Overwrites entities of specified strings. Opensource
Normalizer Removes all dirty characters from text following a regex pattern and transforms words based on a provided dictionary. Opensource
NorvigSweeting Spellchecker Retrieves tokens and makes corrections automatically if not found in an English dictionary. Opensource
POSTagger (Part of speech tagger) Averaged Perceptron model to tag words part-of-speech. Opensource
PromptAssembler Assembles a sequence of messages into a single string using a template. Opensource
RecursiveTokenizer Tokenizes raw text recursively based on a handful of definable rules. Opensource
RegexMatcher Uses rules to match a set of regular expressions and associate them with a provided identifier. Opensource
RegexTokenizer A tokenizer that splits text by a regex pattern. Opensource
SentenceDetector Annotator that detects sentence boundaries using regular expressions. Opensource
SentenceDetectorDL Detects sentence boundaries using a deep learning approach. Opensource
SentenceEmbeddings Converts the results from WordEmbeddings, BertEmbeddings, or ElmoEmbeddings into sentence or document embeddings by either summing up or averaging all the word embeddings in a sentence or a document (depending on the inputCols). Opensource
SentimentDL Annotator for multi-class sentiment analysis. Opensource
SentimentDetector Rule based sentiment detector, which calculates a score based on predefined keywords. Opensource
Stemmer Returns hard-stems out of words with the objective of retrieving the meaningful part of the word. Opensource
StopWordsCleaner This annotator takes a sequence of strings (e.g. the output of a Tokenizer, Normalizer, Lemmatizer, and Stemmer) and drops all the stop words from the input sequences. Opensource
SymmetricDelete Spellchecker Symmetric Delete spelling correction algorithm. Opensource
TextMatcher Matches exact phrases (by token) provided in a file against a Document. Opensource
Token2Chunk Converts TOKEN type Annotations to CHUNK type. Opensource
TokenAssembler This transformer reconstructs a DOCUMENT type annotation from tokens, usually after these have been normalized, lemmatized, normalized, spell checked, etc, in order to use this document annotation in further annotators. Opensource
Tokenizer Tokenizes raw text into word pieces, tokens. Identifies tokens with tokenization open standards. A few rules will help customizing it if defaults do not fit user needs. Opensource
TypedDependencyParser Labeled parser that finds a grammatical relation between two words in a sentence. Opensource
ViveknSentiment Sentiment analyser inspired by the algorithm by Vivek Narayanan. Opensource
WordEmbeddings Word Embeddings lookup annotator that maps tokens to vectors. Opensource
Word2Vec Word2Vec model that creates vector representations of words in a text corpus. Opensource
WordSegmenter Tokenizes non-english or non-whitespace separated texts. Opensource
YakeKeywordExtraction Unsupervised, Corpus-Independent, Domain and Language-Independent and Single-Document keyword extraction. Opensource

Available Transformers

Additionally, these transformers are available.

Transformer Description Version
AlbertEmbeddings ALBERT: A Lite BERT for Self-supervised Learning of Language Representations Opensource
AlbertForQuestionAnswering AlbertForQuestionAnswering can load ALBERT Models with a span classification head on top for extractive question-answering tasks like SQuAD. Opensource
AlbertForTokenClassification AlbertForTokenClassification can load ALBERT Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. Opensource
AlbertForSequenceClassification AlbertForSequenceClassification can load ALBERT Models with sequence classification/regression head on top e.g. for multi-class document classification tasks. Opensource
BartForZeroShotClassification BartForZeroShotClassification using a ModelForSequenceClassification trained on NLI (natural language inference) tasks. Opensource
BartTransformer BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension Transformer Opensource
BertForQuestionAnswering BertForQuestionAnswering can load Bert Models with a span classification head on top for extractive question-answering tasks like SQuAD. Opensource
BertForSequenceClassification Bert Models with sequence classification/regression head on top. Opensource
BertForTokenClassification BertForTokenClassification can load Bert Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. Opensource
BertForZeroShotClassification BertForZeroShotClassification using a ModelForSequenceClassification trained on NLI (natural language inference) tasks. Opensource
BertSentenceEmbeddings Sentence-level embeddings using BERT. BERT (Bidirectional Encoder Representations from Transformers) provides dense vector representations for natural language by using a deep, pre-trained neural network with the Transformer architecture. Opensource
CamemBertEmbeddings CamemBert is based on Facebook’s RoBERTa model released in 2019. Opensource
CamemBertForQuestionAnswering CamemBertForQuestionAnswering can load CamemBERT Models with a span classification head on top for extractive question-answering tasks like SQuAD Opensource
CamemBertForSequenceClassification amemBertForSequenceClassification can load CamemBERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. Opensource
CamemBertForTokenClassification CamemBertForTokenClassification can load CamemBERT Models with a token classification head on top Opensource
CLIPForZeroShotClassification Zero Shot Image Classifier based on CLIP Opensource
ConvNextForImageClassification ConvNextForImageClassification is an image classifier based on ConvNet models Opensource
DeBertaEmbeddings DeBERTa builds on RoBERTa with disentangled attention and enhanced mask decoder training with half of the data used in RoBERTa. Opensource
DeBertaForQuestionAnswering DeBertaForQuestionAnswering can load DeBERTa Models with a span classification head on top for extractive question-answering tasks like SQuAD. Opensource
DeBertaForSequenceClassification DeBertaForSequenceClassification can load DeBerta v2 & v3 Models with sequence classification/regression head on top. Opensource
DeBertaForTokenClassification DeBertaForTokenClassification can load DeBERTA Models v2 and v3 with a token classification head on top. Opensource
DistilBertEmbeddings DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base. Opensource
DistilBertForQuestionAnswering DistilBertForQuestionAnswering can load DistilBert Models with a span classification head on top for extractive question-answering tasks like SQuAD. Opensource
DistilBertForSequenceClassification DistilBertForSequenceClassification can load DistilBERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. Opensource
DistilBertForTokenClassification DistilBertForTokenClassification can load DistilBERT Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. Opensource
DistilBertForZeroShotClassification DistilBertForZeroShotClassification using a ModelForSequenceClassification trained on NLI (natural language inference) tasks. Opensource
E5Embeddings Sentence embeddings using E5, an instruction-finetuned text embedding model that can generate text embeddings tailored to any task. Opensource
ElmoEmbeddings Word embeddings from ELMo (Embeddings from Language Models), a language model trained on the 1 Billion Word Benchmark. Opensource
GPT2Transformer GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset of 8 million web pages. Opensource
HubertForCTC Hubert Model with a language modeling head on top for Connectionist Temporal Classification (CTC). Opensource
InstructorEmbeddings Sentence embeddings using INSTRUCTOR. Opensource
LongformerEmbeddings Longformer is a BERT-like model started from the RoBERTa checkpoint and pretrained for MLM on long documents. Opensource
LongformerForQuestionAnswering LongformerForQuestionAnswering can load Longformer Models with a span classification head on top for extractive question-answering tasks like SQuAD. Opensource
LongformerForSequenceClassification LongformerForSequenceClassification can load Longformer Models with sequence classification/regression head on top e.g. for multi-class document classification tasks. Opensource
LongformerForTokenClassification LongformerForTokenClassification can load Longformer Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. Opensource
MarianTransformer Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. Opensource
MPNetEmbeddings Sentence embeddings using MPNet. Opensource
MPNetForQuestionAnswering MPNet Models with a span classification head on top for extractive question-answering tasks like SQuAD. Opensource
MPNetForSequenceClassification MPNet Models with sequence classification/regression head on top e.g. for multi-class document classification tasks. Opensource
OpenAICompletion Transformer that makes a request for OpenAI Completion API for each executor. Opensource
RoBertaEmbeddings RoBERTa: A Robustly Optimized BERT Pretraining Approach Opensource
RoBertaForQuestionAnswering RoBertaForQuestionAnswering can load RoBERTa Models with a span classification head on top for extractive question-answering tasks like SQuAD. Opensource
RoBertaForSequenceClassification RoBertaForSequenceClassification can load RoBERTa Models with sequence classification/regression head on top e.g. for multi-class document classification tasks. Opensource
RoBertaForTokenClassification RoBertaForTokenClassification can load RoBERTa Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. Opensource
RoBertaForZeroShotClassification RoBertaForZeroShotClassification using a ModelForSequenceClassification trained on NLI (natural language inference) tasks. Opensource
RoBertaForZeroShotClassification RoBertaForZeroShotClassification using a ModelForSequenceClassification trained on NLI (natural language inference) tasks. Opensource
RoBertaSentenceEmbeddings Sentence-level embeddings using RoBERTa. Opensource
SpanBertCoref A coreference resolution model based on SpanBert. Opensource
SwinForImageClassification SwinImageClassification is an image classifier based on Swin. Opensource
T5Transformer T5 reconsiders all NLP tasks into a unified text-to-text-format where the input and output are always text strings, in contrast to BERT-style models that can only output either a class label or a span of the input. Opensource
TapasForQuestionAnswering TapasForQuestionAnswering is an implementation of TaPas - a BERT-based model specifically designed for answering questions about tabular data. Opensource
UAEEmbeddings Sentence embeddings using Universal AnglE Embedding (UAE). Opensource
UniversalSentenceEncoder The Universal Sentence Encoder encodes text into high dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks. Opensource
VisionEncoderDecoderForImageCaptioning VisionEncoderDecoder model that converts images into text captions. Opensource
ViTForImageClassification Vision Transformer (ViT) for image classification. Opensource
Wav2Vec2ForCTC Wav2Vec2 Model with a language modeling head on top for Connectionist Temporal Classification (CTC). Opensource
WhisperForCTC Whisper Model with a language modeling head on top for Connectionist Temporal Classification (CTC). Opensource
XlmRoBertaEmbeddings XlmRoBerta is a large multi-lingual language model, trained on 2.5TB of filtered CommonCrawl Opensource
XlmRoBertaForQuestionAnswering XlmRoBertaForQuestionAnswering can load XLM-RoBERTa Models with a span classification head on top for extractive question-answering tasks like SQuAD. Opensource
XlmRoBertaForSequenceClassification XlmRoBertaForSequenceClassification can load XLM-RoBERTa Models with sequence classification/regression head on top e.g. for multi-class document classification tasks. Opensource
XlmRoBertaForTokenClassification XlmRoBertaForTokenClassification can load XLM-RoBERTa Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. Opensource
XlmRoBertaForZeroShotClassification XlmRoBertaForZeroShotClassification using a ModelForSequenceClassification trained on NLI (natural language inference) tasks. Opensource
XlmRoBertaSentenceEmbeddings Sentence-level embeddings using XLM-RoBERTa. Opensource
XlnetEmbeddings XLNet is a new unsupervised language representation learning method based on a novel generalized permutation language modeling objective. Opensource
XlnetForTokenClassification XlnetForTokenClassification can load XLNet Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. Opensource
XlnetForSequenceClassification XlnetForSequenceClassification can load XLNet Models with sequence classification/regression head on top e.g. for multi-class document classification tasks. Opensource
ZeroShotNer ZeroShotNerModel implements zero shot named entity recognition by utilizing RoBERTa transformer models fine tuned on a question answering task. Opensource

AutoGGUFEmbeddings

Annotator that uses the llama.cpp library to generate text embeddings with large language models.

The type of embedding pooling can be set with the setPoolingType method. The default is "MEAN". The available options are "NONE", "MEAN", "CLS", and "LAST".

If the parameters are not set, the annotator will default to use the parameters provided by the model.

Pretrained models can be loaded with pretrained of the companion object:

val autoGGUFEmbeddings = AutoGGUFEmbeddings.pretrained()
  .setInputCols("document")
  .setOutputCol("embeddings")

The default model is "nomic-embed-text-v1.5.Q8_0.gguf", if no name is provided.

For available pretrained models please see the Models Hub.

For extended examples of usage, see the AutoGGUFEmbeddingsTest and the example notebook.

Note: To use GPU inference with this annotator, make sure to use the Spark NLP GPU package and set the number of GPU layers with the setNGpuLayers method.

When using larger models, we recommend adjusting GPU usage with setNCtx and setNGpuLayers according to your hardware to avoid out-of-memory errors.

Input Annotator Types: DOCUMENT

Output Annotator Type: SENTENCE_EMBEDDINGS

Python API: AutoGGUFEmbeddings Scala API: AutoGGUFEmbeddings Source: AutoGGUFEmbeddings
Show Example
>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> document = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> autoGGUFEmbeddings = AutoGGUFEmbeddings.pretrained() \
...     .setInputCols(["document"]) \
...     .setOutputCol("completions") \
...     .setBatchSize(4) \
...     .setNGpuLayers(99) \
...     .setPoolingType("MEAN")
>>> pipeline = Pipeline().setStages([document, autoGGUFEmbeddings])
>>> data = spark.createDataFrame([["The moons of Jupiter are 77 in total, with 79 confirmed natural satellites and 2 man-made ones."]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.select("completions").show()
+--------------------------------------------------------------------------------+
|                                                                      embeddings|
+--------------------------------------------------------------------------------+
|[[-0.034486726, 0.07770534, -0.15982522, -0.017873349, 0.013914132, 0.0365736...|
+--------------------------------------------------------------------------------+
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline
import spark.implicits._

val document = new DocumentAssembler().setInputCol("text").setOutputCol("document")

val autoGGUFEmbeddings = AutoGGUFEmbeddings
  .pretrained()
  .setInputCols("document")
  .setOutputCol("embeddings")
  .setBatchSize(4)
  .setPoolingType("MEAN")

val pipeline = new Pipeline().setStages(Array(document, autoGGUFEmbeddings))

val data = Seq(
  "The moons of Jupiter are 77 in total, with 79 confirmed natural satellites and 2 man-made ones.")
  .toDF("text")
val result = pipeline.fit(data).transform(data)
result.select("embeddings.embeddings").show(1, truncate=80)
+--------------------------------------------------------------------------------+
|                                                                      embeddings|
+--------------------------------------------------------------------------------+
|[[-0.034486726, 0.07770534, -0.15982522, -0.017873349, 0.013914132, 0.0365736...|
+--------------------------------------------------------------------------------+

AutoGGUFModel

Annotator that uses the llama.cpp library to generate text completions with large language models.

For settable parameters, and their explanations, see HasLlamaCppProperties and refer to the llama.cpp documentation of server.cpp for more information.

If the parameters are not set, the annotator will default to use the parameters provided by the model.

Pretrained models can be loaded with pretrained of the companion object:

val autoGGUFModel = AutoGGUFModel.pretrained()
  .setInputCols("document")
  .setOutputCol("completions")

The default model is "phi3.5_mini_4k_instruct_q4_gguf", if no name is provided.

For available pretrained models please see the Models Hub.

For extended examples of usage, see the AutoGGUFModelTest and the example notebook.

Note: To use GPU inference with this annotator, make sure to use the Spark NLP GPU package and set the number of GPU layers with the setNGpuLayers method.

When using larger models, we recommend adjusting GPU usage with setNCtx and setNGpuLayers according to your hardware to avoid out-of-memory errors.

Input Annotator Types: DOCUMENT

Output Annotator Type: DOCUMENT

Python API: AutoGGUFModel Scala API: AutoGGUFModel Source: AutoGGUFModel
Show Example
>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> document = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> autoGGUFModel = AutoGGUFModel.pretrained() \
...     .setInputCols(["document"]) \
...     .setOutputCol("completions") \
...     .setBatchSize(4) \
...     .setNPredict(20) \
...     .setNGpuLayers(99) \
...     .setTemperature(0.4) \
...     .setTopK(40) \
...     .setTopP(0.9) \
...     .setPenalizeNl(True)
>>> pipeline = Pipeline().setStages([document, autoGGUFModel])
>>> data = spark.createDataFrame([["Hello, I am a"]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.select("completions").show(truncate = False)
+-----------------------------------------------------------------------------------------------------------------------------------+
|completions                                                                                                                        |
+-----------------------------------------------------------------------------------------------------------------------------------+
|[{document, 0, 78,  new user.  I am currently working on a project and I need to create a list of , {prompt -> Hello, I am a}, []}]|
+-----------------------------------------------------------------------------------------------------------------------------------+
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline
import spark.implicits._

val document = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val autoGGUFModel = AutoGGUFModel
  .pretrained()
  .setInputCols("document")
  .setOutputCol("completions")
  .setBatchSize(4)
  .setNPredict(20)
  .setNGpuLayers(99)
  .setTemperature(0.4f)
  .setTopK(40)
  .setTopP(0.9f)
  .setPenalizeNl(true)

val pipeline = new Pipeline().setStages(Array(document, autoGGUFModel))

val data = Seq("Hello, I am a").toDF("text")
val result = pipeline.fit(data).transform(data)
result.select("completions").show(truncate = false)
+-----------------------------------------------------------------------------------------------------------------------------------+
|completions                                                                                                                        |
+-----------------------------------------------------------------------------------------------------------------------------------+
|[{document, 0, 78,  new user.  I am currently working on a project and I need to create a list of , {prompt -> Hello, I am a}, []}]|
+-----------------------------------------------------------------------------------------------------------------------------------+

BGEEmbeddings

Sentence embeddings using BGE.

BGE, or BAAI General Embeddings, a model that can map any text to a low-dimensional dense vector which can be used for tasks like retrieval, classification, clustering, or semantic search.

Note that this annotator is only supported for Spark Versions 3.4 and up.

Pretrained models can be loaded with pretrained of the companion object:

val embeddings = BGEEmbeddings.pretrained()
  .setInputCols("document")
  .setOutputCol("embeddings")

The default model is "bge_base", if no name is provided.

For available pretrained models please see the Models Hub.

For extended examples of usage, see BGEEmbeddingsTestSpec.

Sources :

C-Pack: Packaged Resources To Advance General Chinese Embedding

BGE Github Repository

Paper abstract

We introduce C-Pack, a package of resources that significantly advance the field of general Chinese embeddings. C-Pack includes three critical resources. 1) C-MTEB is a comprehensive benchmark for Chinese text embeddings covering 6 tasks and 35 datasets. 2) C-MTP is a massive text embedding dataset curated from labeled and unlabeled Chinese corpora for training embedding models. 3) C-TEM is a family of embedding models covering multiple sizes. Our models outperform all prior Chinese text embeddings on C-MTEB by up to +10% upon the time of the release. We also integrate and optimize the entire suite of training methods for C-TEM. Along with our resources on general Chinese embedding, we release our data and models for English text embeddings. The English models achieve stateof-the-art performance on the MTEB benchmark; meanwhile, our released English data is 2 times larger than the Chinese data. All these resources are made publicly available at https://github.com/FlagOpen/FlagEmbedding.

Input Annotator Types: DOCUMENT

Output Annotator Type: SENTENCE_EMBEDDINGS

Python API: BGEEmbeddings Scala API: BGEEmbeddings Source: BGEEmbeddings
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")
embeddings = BGEEmbeddings.pretrained() \
    .setInputCols(["document"]) \
    .setOutputCol("bge_embeddings")
embeddingsFinisher = EmbeddingsFinisher() \
    .setInputCols(["bge_embeddings"]) \
    .setOutputCols("finished_embeddings") \
    .setOutputAsVector(True)
pipeline = Pipeline().setStages([
    documentAssembler,
    embeddings,
    embeddingsFinisher
])
data = spark.createDataFrame([["query: how much protein should a female eat",
"passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day." + \
"But, as you can see from this chart, you'll need to increase that if you're expecting or training for a" + \
"marathon. Check out the chart below to see how much protein you should be eating each day.",
]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[[8.0190285E-4, -0.005974853, -0.072875895, 0.007944068, 0.026059335, -0.0080...|
|[[0.050514214, 0.010061974, -0.04340176, -0.020937217, 0.05170225, 0.01157857...|
+--------------------------------------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.embeddings.BGEEmbeddings
import com.johnsnowlabs.nlp.EmbeddingsFinisher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val embeddings = BGEEmbeddings.pretrained("bge_base", "en")
  .setInputCols("document")
  .setOutputCol("bge_embeddings")

val embeddingsFinisher = new EmbeddingsFinisher()
  .setInputCols("bge_embeddings")
  .setOutputCols("finished_embeddings")
  .setOutputAsVector(true)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  embeddings,
  embeddingsFinisher
))

val data = Seq("query: how much protein should a female eat",
"passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day." +
But, as you can see from this chart, you'll need to increase that if you're expecting or training for a" +
marathon. Check out the chart below to see how much protein you should be eating each day."

).toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(1, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[[8.0190285E-4, -0.005974853, -0.072875895, 0.007944068, 0.026059335, -0.0080...|
|[[0.050514214, 0.010061974, -0.04340176, -0.020937217, 0.05170225, 0.01157857...|
+--------------------------------------------------------------------------------+

BigTextMatcher

Annotator to match exact phrases (by token) provided in a file against a Document.

A text file of predefined phrases must be provided with setStoragePath.

In contrast to the normal TextMatcher, the BigTextMatcher is designed for large corpora.

For extended examples of usage, see the BigTextMatcherTestSpec.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: CHUNK

Python API: BigTextMatcher Scala API: BigTextMatcher Source: BigTextMatcher
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
# In this example, the entities file is of the form
#
# ...
# dolore magna aliqua
# lorem ipsum dolor. sit
# laborum
# ...
#
# where each line represents an entity phrase to be extracted.

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols("document") \
    .setOutputCol("token")

data = spark.createDataFrame([["Hello dolore magna aliqua. Lorem ipsum dolor. sit in laborum"]]).toDF("text")
entityExtractor = BigTextMatcher() \
    .setInputCols("document", "token") \
    .setStoragePath("src/test/resources/entity-extractor/test-phrases.txt", ReadAs.TEXT) \
    .setOutputCol("entity") \
    .setCaseSensitive(False)

pipeline = Pipeline().setStages([documentAssembler, tokenizer, entityExtractor])
results = pipeline.fit(data).transform(data)
results.selectExpr("explode(entity)").show(truncate=False)
+--------------------------------------------------------------------+
|col                                                                 |
+--------------------------------------------------------------------+
|[chunk, 6, 24, dolore magna aliqua, [sentence -> 0, chunk -> 0], []]|
|[chunk, 53, 59, laborum, [sentence -> 0, chunk -> 1], []]           |
+--------------------------------------------------------------------+
// In this example, the entities file is of the form
//
// ...
// dolore magna aliqua
// lorem ipsum dolor. sit
// laborum
// ...
//
// where each line represents an entity phrase to be extracted.
import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.Tokenizer
import com.johnsnowlabs.nlp.annotator.BigTextMatcher
import com.johnsnowlabs.nlp.util.io.ReadAs
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val data = Seq("Hello dolore magna aliqua. Lorem ipsum dolor. sit in laborum").toDF("text")
val entityExtractor = new BigTextMatcher()
  .setInputCols("document", "token")
  .setStoragePath("src/test/resources/entity-extractor/test-phrases.txt", ReadAs.TEXT)
  .setOutputCol("entity")
  .setCaseSensitive(false)

val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, entityExtractor))
val results = pipeline.fit(data).transform(data)
results.selectExpr("explode(entity)").show(false)
+--------------------------------------------------------------------+
|col                                                                 |
+--------------------------------------------------------------------+
|[chunk, 6, 24, dolore magna aliqua, [sentence -> 0, chunk -> 0], []]|
|[chunk, 53, 59, laborum, [sentence -> 0, chunk -> 1], []]           |
+--------------------------------------------------------------------+

Chunk2Doc

Converts a CHUNK type column back into DOCUMENT. Useful when trying to re-tokenize or do further analysis on a CHUNK result.

Input Annotator Types: CHUNK

Output Annotator Type: DOCUMENT

Python API: Chunk2Doc Scala API: Chunk2Doc Source: Chunk2Doc
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
from sparknlp.pretrained import PretrainedPipeline
# Location entities are extracted and converted back into `DOCUMENT` type for further processing

data = spark.createDataFrame([[1, "New York and New Jersey aren't that far apart actually."]]).toDF("id", "text")

# Extracts Named Entities amongst other things
pipeline = PretrainedPipeline("explain_document_dl")

chunkToDoc = Chunk2Doc().setInputCols("entities").setOutputCol("chunkConverted")
explainResult = pipeline.transform(data)

result = chunkToDoc.transform(explainResult)
result.selectExpr("explode(chunkConverted)").show(truncate=False)
+------------------------------------------------------------------------------+
|col                                                                           |
+------------------------------------------------------------------------------+
|[document, 0, 7, New York, [entity -> LOC, sentence -> 0, chunk -> 0], []]    |
|[document, 13, 22, New Jersey, [entity -> LOC, sentence -> 0, chunk -> 1], []]|
+------------------------------------------------------------------------------+
// Location entities are extracted and converted back into `DOCUMENT` type for further processing
import spark.implicits._
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
import com.johnsnowlabs.nlp.Chunk2Doc

val data = Seq((1, "New York and New Jersey aren't that far apart actually.")).toDF("id", "text")

// Extracts Named Entities amongst other things
val pipeline = PretrainedPipeline("explain_document_dl")

val chunkToDoc = new Chunk2Doc().setInputCols("entities").setOutputCol("chunkConverted")
val explainResult = pipeline.transform(data)

val result = chunkToDoc.transform(explainResult)
result.selectExpr("explode(chunkConverted)").show(false)
+------------------------------------------------------------------------------+
|col                                                                           |
+------------------------------------------------------------------------------+
|[document, 0, 7, New York, [entity -> LOC, sentence -> 0, chunk -> 0], []]    |
|[document, 13, 22, New Jersey, [entity -> LOC, sentence -> 0, chunk -> 1], []]|
+------------------------------------------------------------------------------+

ChunkEmbeddings

This annotator utilizes WordEmbeddings, BertEmbeddings etc. to generate chunk embeddings from either Chunker, NGramGenerator, or NerConverter outputs.

For extended examples of usage, see the Examples and the ChunkEmbeddingsTestSpec.

Input Annotator Types: CHUNK, WORD_EMBEDDINGS

Output Annotator Type: WORD_EMBEDDINGS

Python API: ChunkEmbeddings Scala API: ChunkEmbeddings Source: ChunkEmbeddings
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

# Extract the Embeddings from the NGrams
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

nGrams = NGramGenerator() \
    .setInputCols(["token"]) \
    .setOutputCol("chunk") \
    .setN(2)

embeddings = WordEmbeddingsModel.pretrained() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings") \
    .setCaseSensitive(False)

# Convert the NGram chunks into Word Embeddings
chunkEmbeddings = ChunkEmbeddings() \
    .setInputCols(["chunk", "embeddings"]) \
    .setOutputCol("chunk_embeddings") \
    .setPoolingStrategy("AVERAGE")

pipeline = Pipeline() \
    .setStages([
      documentAssembler,
      sentence,
      tokenizer,
      nGrams,
      embeddings,
      chunkEmbeddings
    ])

data = spark.createDataFrame([["This is a sentence."]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("explode(chunk_embeddings) as result") \
    .select("result.annotatorType", "result.result", "result.embeddings") \
    .show(5, 80)
+---------------+----------+--------------------------------------------------------------------------------+
|  annotatorType|    result|                                                                      embeddings|
+---------------+----------+--------------------------------------------------------------------------------+
|word_embeddings|   This is|[-0.55661, 0.42829502, 0.86661, -0.409785, 0.06316501, 0.120775, -0.0732005, ...|
|word_embeddings|      is a|[-0.40674996, 0.22938299, 0.50597, -0.288195, 0.555655, 0.465145, 0.140118, 0...|
|word_embeddings|a sentence|[0.17417, 0.095253006, -0.0530925, -0.218465, 0.714395, 0.79860497, 0.0129999...|
|word_embeddings|sentence .|[0.139705, 0.177955, 0.1887775, -0.45545, 0.20030999, 0.461557, -0.07891501, ...|
+---------------+----------+--------------------------------------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
import com.johnsnowlabs.nlp.annotators.{NGramGenerator, Tokenizer}
import com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel
import com.johnsnowlabs.nlp.embeddings.ChunkEmbeddings
import org.apache.spark.ml.Pipeline

// Extract the Embeddings from the NGrams
val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentence = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val nGrams = new NGramGenerator()
  .setInputCols("token")
  .setOutputCol("chunk")
  .setN(2)

val embeddings = WordEmbeddingsModel.pretrained()
  .setInputCols("sentence", "token")
  .setOutputCol("embeddings")
  .setCaseSensitive(false)

// Convert the NGram chunks into Word Embeddings
val chunkEmbeddings = new ChunkEmbeddings()
  .setInputCols("chunk", "embeddings")
  .setOutputCol("chunk_embeddings")
  .setPoolingStrategy("AVERAGE")

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    sentence,
    tokenizer,
    nGrams,
    embeddings,
    chunkEmbeddings
  ))

val data = Seq("This is a sentence.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(chunk_embeddings) as result")
  .select("result.annotatorType", "result.result", "result.embeddings")
  .show(5, 80)
+---------------+----------+--------------------------------------------------------------------------------+
|  annotatorType|    result|                                                                      embeddings|
+---------------+----------+--------------------------------------------------------------------------------+
|word_embeddings|   This is|[-0.55661, 0.42829502, 0.86661, -0.409785, 0.06316501, 0.120775, -0.0732005, ...|
|word_embeddings|      is a|[-0.40674996, 0.22938299, 0.50597, -0.288195, 0.555655, 0.465145, 0.140118, 0...|
|word_embeddings|a sentence|[0.17417, 0.095253006, -0.0530925, -0.218465, 0.714395, 0.79860497, 0.0129999...|
|word_embeddings|sentence .|[0.139705, 0.177955, 0.1887775, -0.45545, 0.20030999, 0.461557, -0.07891501, ...|
+---------------+----------+--------------------------------------------------------------------------------+

ChunkTokenizer

Tokenizes and flattens extracted NER chunks.

The ChunkTokenizer will split the extracted NER CHUNK type Annotations and will create TOKEN type Annotations. The result is then flattened, resulting in a single array.

For extended examples of usage, see the ChunkTokenizerTestSpec.

Input Annotator Types: CHUNK

Output Annotator Type: TOKEN

Python API: ChunkTokenizer Scala API: ChunkTokenizer Source: ChunkTokenizer
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

entityExtractor = TextMatcher() \
    .setInputCols(["sentence", "token"]) \
    .setEntities("src/test/resources/entity-extractor/test-chunks.txt", ReadAs.TEXT) \
    .setOutputCol("entity")

chunkTokenizer = ChunkTokenizer() \
    .setInputCols(["entity"]) \
    .setOutputCol("chunk_token")

pipeline = Pipeline().setStages([
      documentAssembler,
      sentenceDetector,
      tokenizer,
      entityExtractor,
      chunkTokenizer
    ])

data = spark.createDataFrame([[
    "Hello world, my name is Michael, I am an artist and I work at Benezar",
    "Robert, an engineer from Farendell, graduated last year. The other one, Lucas, graduated last week."
]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("entity.result as entity" , "chunk_token.result as chunk_token").show(truncate=False)
+-----------------------------------------------+---------------------------------------------------+
|entity                                         |chunk_token                                        |
+-----------------------------------------------+---------------------------------------------------+
|[world, Michael, work at Benezar]              |[world, Michael, work, at, Benezar]                |
|[engineer from Farendell, last year, last week]|[engineer, from, Farendell, last, year, last, week]|
+-----------------------------------------------+---------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.{ChunkTokenizer, TextMatcher, Tokenizer}
import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
import com.johnsnowlabs.nlp.util.io.ReadAs
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val entityExtractor = new TextMatcher()
  .setInputCols("sentence", "token")
  .setEntities("src/test/resources/entity-extractor/test-chunks.txt", ReadAs.TEXT)
  .setOutputCol("entity")

val chunkTokenizer = new ChunkTokenizer()
  .setInputCols("entity")
  .setOutputCol("chunk_token")

val pipeline = new Pipeline().setStages(Array(
    documentAssembler,
    sentenceDetector,
    tokenizer,
    entityExtractor,
    chunkTokenizer
  ))

val data = Seq(
  "Hello world, my name is Michael, I am an artist and I work at Benezar",
  "Robert, an engineer from Farendell, graduated last year. The other one, Lucas, graduated last week."
).toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("entity.result as entity" , "chunk_token.result as chunk_token").show(false)
+-----------------------------------------------+---------------------------------------------------+
|entity                                         |chunk_token                                        |
+-----------------------------------------------+---------------------------------------------------+
|[world, Michael, work at Benezar]              |[world, Michael, work, at, Benezar]                |
|[engineer from Farendell, last year, last week]|[engineer, from, Farendell, last, year, last, week]|
+-----------------------------------------------+---------------------------------------------------+

Chunker

This annotator matches a pattern of part-of-speech tags in order to return meaningful phrases from document. Extracted part-of-speech tags are mapped onto the sentence, which can then be parsed by regular expressions. The part-of-speech tags are wrapped by angle brackets <> to be easily distinguishable in the text itself. This example sentence will result in the form:

"Peter Pipers employees are picking pecks of pickled peppers."
"<NNP><NNP><NNS><VBP><VBG><NNS><IN><JJ><NNS><.>"

To then extract these tags, regexParsers need to be set with e.g.:

val chunker = new Chunker()
  .setInputCols("sentence", "pos")
  .setOutputCol("chunk")
  .setRegexParsers(Array("<NNP>+", "<NNS>+"))

When defining the regular expressions, tags enclosed in angle brackets are treated as groups, so here specifically "<NNP>+" means 1 or more nouns in succession. Additional patterns can also be set with addRegexParsers.

For more extended examples see the Examples) and the ChunkerTestSpec.

Input Annotator Types: DOCUMENT, POS

Output Annotator Type: CHUNK

Python API: Chunker Scala API: Chunker Source: Chunker
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols("document") \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

POSTag = PerceptronModel.pretrained() \
    .setInputCols("document", "token") \
    .setOutputCol("pos")

chunker = Chunker() \
    .setInputCols("sentence", "pos") \
    .setOutputCol("chunk") \
    .setRegexParsers(["<NNP>+", "<NNS>+"])

pipeline = Pipeline() \
    .setStages([
      documentAssembler,
      sentence,
      tokenizer,
      POSTag,
      chunker
    ])

data = spark.createDataFrame([["Peter Pipers employees are picking pecks of pickled peppers."]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("explode(chunk) as result").show(truncate=False)
+-------------------------------------------------------------+
|result                                                       |
+-------------------------------------------------------------+
|[chunk, 0, 11, Peter Pipers, [sentence -> 0, chunk -> 0], []]|
|[chunk, 13, 21, employees, [sentence -> 0, chunk -> 1], []]  |
|[chunk, 35, 39, pecks, [sentence -> 0, chunk -> 2], []]      |
|[chunk, 52, 58, peppers, [sentence -> 0, chunk -> 3], []]    |
+-------------------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.{Chunker, Tokenizer}
import com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel
import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentence = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val POSTag = PerceptronModel.pretrained()
  .setInputCols("document", "token")
  .setOutputCol("pos")

val chunker = new Chunker()
  .setInputCols("sentence", "pos")
  .setOutputCol("chunk")
  .setRegexParsers(Array("<NNP>+", "<NNS>+"))

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    sentence,
    tokenizer,
    POSTag,
    chunker
  ))

val data = Seq("Peter Pipers employees are picking pecks of pickled peppers.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(chunk) as result").show(false)
+-------------------------------------------------------------+
|result                                                       |
+-------------------------------------------------------------+
|[chunk, 0, 11, Peter Pipers, [sentence -> 0, chunk -> 0], []]|
|[chunk, 13, 21, employees, [sentence -> 0, chunk -> 1], []]  |
|[chunk, 35, 39, pecks, [sentence -> 0, chunk -> 2], []]      |
|[chunk, 52, 58, peppers, [sentence -> 0, chunk -> 3], []]    |
+-------------------------------------------------------------+

ClassifierDL

Trains a ClassifierDL for generic Multi-class Text Classification.

ClassifierDL uses the state-of-the-art Universal Sentence Encoder as an input for text classifications. The ClassifierDL annotator uses a deep learning model (DNNs) we have built inside TensorFlow and supports up to 100 classes.

For instantiated/pretrained models, see ClassifierDLModel.

Setting a test dataset to monitor model metrics can be done with .setTestDataset. The method expects a path to a parquet file containing a dataframe that has the same required columns as the training dataframe. The pre-processing steps for the training dataframe should also be applied to the test dataframe. The following example will show how to create the test dataset:

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val embeddings = UniversalSentenceEncoder.pretrained()
  .setInputCols("document")
  .setOutputCol("sentence_embeddings")

val preProcessingPipeline = new Pipeline().setStages(Array(documentAssembler, embeddings))

val Array(train, test) = data.randomSplit(Array(0.8, 0.2))
preProcessingPipeline
  .fit(test)
  .transform(test)
  .write
  .mode("overwrite")
  .parquet("test_data")

val classifier = new ClassifierDLApproach()
  .setInputCols("sentence_embeddings")
  .setOutputCol("category")
  .setLabelColumn("label")
  .setTestDataset("test_data")

For extended examples of usage, see the Examples [1] [2] and the ClassifierDLTestSpec.

Input Annotator Types: SENTENCE_EMBEDDINGS

Output Annotator Type: CATEGORY

Note: This annotator accepts a label column of a single item in either type of String, Int, Float, or Double. UniversalSentenceEncoder, BertSentenceEmbeddings, or SentenceEmbeddings can be used for the inputCol

Python API: ClassifierDLApproach Scala API: ClassifierDLApproach Source: ClassifierDLApproach
Show Example
# In this example, the training data `"sentiment.csv"` has the form of
#
# text,label
# This movie is the best movie I have wached ever! In my opinion this movie can win an award.,0
# This was a terrible movie! The acting was bad really bad!,1
# ...
#
# Then traning can be done like so:

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

smallCorpus = spark.read.option("header","True").csv("src/test/resources/classifier/sentiment.csv")

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

useEmbeddings = UniversalSentenceEncoder.pretrained() \
    .setInputCols("document") \
    .setOutputCol("sentence_embeddings")

docClassifier = ClassifierDLApproach() \
    .setInputCols("sentence_embeddings") \
    .setOutputCol("category") \
    .setLabelColumn("label") \
    .setBatchSize(64) \
    .setMaxEpochs(20) \
    .setLr(5e-3) \
    .setDropout(0.5)

pipeline = Pipeline() \
    .setStages(
      [
        documentAssembler,
        useEmbeddings,
        docClassifier
      ]
    )

pipelineModel = pipeline.fit(smallCorpus)
// In this example, the training data `"sentiment.csv"` has the form of
//
// text,label
// This movie is the best movie I have wached ever! In my opinion this movie can win an award.,0
// This was a terrible movie! The acting was bad really bad!,1
// ...
//
// Then traning can be done like so:

import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.embeddings.UniversalSentenceEncoder
import com.johnsnowlabs.nlp.annotators.classifier.dl.ClassifierDLApproach
import org.apache.spark.ml.Pipeline

val smallCorpus = spark.read.option("header","true").csv("src/test/resources/classifier/sentiment.csv")

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val useEmbeddings = UniversalSentenceEncoder.pretrained()
  .setInputCols("document")
  .setOutputCol("sentence_embeddings")

val docClassifier = new ClassifierDLApproach()
  .setInputCols("sentence_embeddings")
  .setOutputCol("category")
  .setLabelColumn("label")
  .setBatchSize(64)
  .setMaxEpochs(20)
  .setLr(5e-3f)
  .setDropout(0.5f)

val pipeline = new Pipeline()
  .setStages(
    Array(
      documentAssembler,
      useEmbeddings,
      docClassifier
    )
  )

val pipelineModel = pipeline.fit(smallCorpus)

ContextSpellChecker

Trains a deep-learning based Noisy Channel Model Spell Algorithm. Correction candidates are extracted combining context information and word information.

For instantiated/pretrained models, see ContextSpellCheckerModel.

Spell Checking is a sequence to sequence mapping problem. Given an input sequence, potentially containing a certain number of errors, ContextSpellChecker will rank correction sequences according to three things:

  1. Different correction candidates for each word — word level.
  2. The surrounding text of each word, i.e. it’s context — sentence level.
  3. The relative cost of different correction candidates according to the edit operations at the character level it requires — subword level.

For an in-depth explanation of the module see the article Applying Context Aware Spell Checking in Spark NLP.

For extended examples of usage, see the article Training a Contextual Spell Checker for Italian Language, the Examples and the ContextSpellCheckerTestSpec.

Input Annotator Types: TOKEN

Output Annotator Type: TOKEN

Python API: ContextSpellCheckerApproach Scala API: ContextSpellCheckerApproach Source: ContextSpellCheckerApproach
Show Example
# For this example, we use the first Sherlock Holmes book as the training dataset.

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")


tokenizer = Tokenizer() \
    .setInputCols("document") \
    .setOutputCol("token")

spellChecker = ContextSpellCheckerApproach() \
    .setInputCols("token") \
    .setOutputCol("corrected") \
    .setWordMaxDistance(3) \
    .setBatchSize(24) \
    .setEpochs(8) \
    .setLanguageModelClasses(1650)  # dependant on vocabulary size
    # .addVocabClass("_NAME_", names) # Extra classes for correction could be added like this

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    spellChecker
])

path = "sherlockholmes.txt"
dataset = spark.read.text(path) \
    .toDF("text")
pipelineModel = pipeline.fit(dataset)
// For this example, we use the first Sherlock Holmes book as the training dataset.

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.spell.context.ContextSpellCheckerApproach

import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")


val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val spellChecker = new ContextSpellCheckerApproach()
  .setInputCols("token")
  .setOutputCol("corrected")
  .setWordMaxDistance(3)
  .setBatchSize(24)
  .setEpochs(8)
  .setLanguageModelClasses(1650)  // dependant on vocabulary size
  // .addVocabClass("_NAME_", names) // Extra classes for correction could be added like this

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  tokenizer,
  spellChecker
))

val path = "src/test/resources/spell/sherlockholmes.txt"
val dataset = spark.sparkContext.textFile(path)
  .toDF("text")
val pipelineModel = pipeline.fit(dataset)

Date2Chunk

Converts DATE type Annotations to CHUNK type.

This can be useful if the following annotators after DateMatcher and MultiDateMatcher require CHUNK types. The entity name in the metadata can be changed with setEntityName.

Input Annotator Types: DATE

Output Annotator Type: CHUNK

Python API: Date2Chunk Scala API: Date2Chunk Source: Date2Chunk
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

date = DateMatcher() \
    .setInputCols(["document"]) \
    .setOutputCol("date")

date2Chunk = Date2Chunk() \
    .setInputCols(["date"]) \
    .setOutputCol("date_chunk")

pipeline = Pipeline().setStages([
    documentAssembler,
    date,
    date2Chunk
])

data = spark.createDataFrame([["Omicron is a new variant of COVID-19, which the World Health Organization designated a variant of concern on Nov. 26, 2021/26/11."]]).toDF("text")

result = pipeline.fit(data).transform(data)

result.select("date_chunk").show(1, truncate=False)
+----------------------------------------------------+
|date_chunk                                          |
+----------------------------------------------------+
|[{chunk, 118, 121, 2021/01/01, {sentence -> 0}, []}]|
+----------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.annotator._

import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val inputFormats = Array("yyyy", "yyyy/dd/MM", "MM/yyyy", "yyyy")
val outputFormat = "yyyy/MM/dd"

val date = new DateMatcher()
  .setInputCols("document")
  .setOutputCol("date")

val date2Chunk = new Date2Chunk()
  .setInputCols("date")
  .setOutputCol("date_chunk")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  date,
  date2Chunk
))

val data = Seq(
"""Omicron is a new variant of COVID-19, which the World Health Organization designated a variant of concern on Nov. 26, 2021/26/11.""",
"""Neighbouring Austria has already locked down its population this week for at until 2021/10/12, becoming the first to reimpose such restrictions."""
).toDF("text")
val result = pipeline.fit(data).transform(data)

result.transform(data).select("date_chunk").show(false)
----------------------------------------------------+
date_chunk                                          |
----------------------------------------------------+
[{chunk, 118, 121, 2021/01/01, {sentence -> 0}, []}]|
[{chunk, 83, 86, 2021/01/01, {sentence -> 0}, []}]  |
----------------------------------------------------+

DateMatcher

Matches standard date formats into a provided format.

Reads from different forms of date and time expressions and converts them to a provided date format.

Extracts only one date per document. Use with sentence detector to find matches in each sentence. To extract multiple dates from a document, please use the MultiDateMatcher.

Reads the following kind of dates:

"1978-01-28", "1984/04/02,1/02/1980", "2/28/79", "The 31st of April in the year 2008",
"Fri, 21 Nov 1997", "Jan 21, ‘97", "Sun", "Nov 21", "jan 1st", "next thursday",
"last wednesday", "today", "tomorrow", "yesterday", "next week", "next month",
"next year", "day after", "the day before", "0600h", "06:00 hours", "6pm", "5:30 a.m.",
"at 5", "12:59", "23:59", "1988/11/23 6pm", "next week at 7.30", "5 am tomorrow"

For example "The 31st of April in the year 2008" will be converted into 2008/04/31.

Pretrained pipelines are available for this module, see Pipelines.

For extended examples of usage, see the Examples and the DateMatcherTestSpec.

Input Annotator Types: DOCUMENT

Output Annotator Type: DATE

Python API: DateMatcher Scala API: DateMatcher Source: DateMatcher
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

date = DateMatcher() \
    .setInputCols("document") \
    .setOutputCol("date") \
    .setAnchorDateYear(2020) \
    .setAnchorDateMonth(1) \
    .setAnchorDateDay(11) \
    .setDateFormat("yyyy/MM/dd")

pipeline = Pipeline().setStages([
    documentAssembler,
    date
])

data = spark.createDataFrame([["Fri, 21 Nov 1997"], ["next week at 7.30"], ["see you a day after"]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("date").show(truncate=False)
+-------------------------------------------------+
|date                                             |
+-------------------------------------------------+
|[[date, 5, 15, 1997/11/21, [sentence -> 0], []]] |
|[[date, 0, 8, 2020/01/18, [sentence -> 0], []]]  |
|[[date, 10, 18, 2020/01/12, [sentence -> 0], []]]|
+-------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.DateMatcher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val date = new DateMatcher()
  .setInputCols("document")
  .setOutputCol("date")
  .setAnchorDateYear(2020)
  .setAnchorDateMonth(1)
  .setAnchorDateDay(11)
  .setDateFormat("yyyy/MM/dd")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  date
))

val data = Seq("Fri, 21 Nov 1997", "next week at 7.30", "see you a day after").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("date").show(false)
+-------------------------------------------------+
|date                                             |
+-------------------------------------------------+
|[[date, 5, 15, 1997/11/21, [sentence -> 0], []]] |
|[[date, 0, 8, 2020/01/18, [sentence -> 0], []]]  |
|[[date, 10, 18, 2020/01/12, [sentence -> 0], []]]|
+-------------------------------------------------+

DependencyParser

Trains an unlabeled parser that finds a grammatical relations between two words in a sentence.

For instantiated/pretrained models, see DependencyParserModel.

Dependency parser provides information about word relationship. For example, dependency parsing can tell you what the subjects and objects of a verb are, as well as which words are modifying (describing) the subject. This can help you find precise answers to specific questions.

The required training data can be set in two different ways (only one can be chosen for a particular model):

Apart from that, no additional training data is needed.

See DependencyParserApproachTestSpec for further reference on how to use this API.

Input Annotator Types: DOCUMENT, POS, TOKEN

Output Annotator Type: DEPENDENCY

Python API: DependencyParserApproach Scala API: DependencyParserApproach Source: DependencyParserApproach
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols("document") \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols("sentence") \
    .setOutputCol("token")

posTagger = PerceptronModel.pretrained() \
    .setInputCols("sentence", "token") \
    .setOutputCol("pos")

dependencyParserApproach = DependencyParserApproach() \
    .setInputCols("sentence", "pos", "token") \
    .setOutputCol("dependency") \
    .setDependencyTreeBank("src/test/resources/parser/unlabeled/dependency_treebank")

pipeline = Pipeline().setStages([
    documentAssembler,
    sentence,
    tokenizer,
    posTagger,
    dependencyParserApproach
])

# Additional training data is not needed, the dependency parser relies on the dependency tree bank / CoNLL-U only.
emptyDataSet = spark.createDataFrame([[""]]).toDF("text")
pipelineModel = pipeline.fit(emptyDataSet)
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel
import com.johnsnowlabs.nlp.annotators.parser.dep.DependencyParserApproach
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentence = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

val posTagger = PerceptronModel.pretrained()
  .setInputCols("sentence", "token")
  .setOutputCol("pos")

val dependencyParserApproach = new DependencyParserApproach()
  .setInputCols("sentence", "pos", "token")
  .setOutputCol("dependency")
  .setDependencyTreeBank("src/test/resources/parser/unlabeled/dependency_treebank")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentence,
  tokenizer,
  posTagger,
  dependencyParserApproach
))

// Additional training data is not needed, the dependency parser relies on the dependency tree bank / CoNLL-U only.
val emptyDataSet = Seq.empty[String].toDF("text")
val pipelineModel = pipeline.fit(emptyDataSet)

Doc2Chunk

Converts DOCUMENT type annotations into CHUNK type with the contents of a chunkCol. Chunk text must be contained within input DOCUMENT. May be either StringType or ArrayType[StringType] (using setIsArray). Useful for annotators that require a CHUNK type input.

Input Annotator Types: DOCUMENT

Output Annotator Type: CHUNK

Python API: Doc2Chunk Scala API: Doc2Chunk Source: Doc2Chunk
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
chunkAssembler = Doc2Chunk() \
    .setInputCols("document") \
    .setChunkCol("target") \
    .setOutputCol("chunk") \
    .setIsArray(True)

data = spark.createDataFrame([[
    "Spark NLP is an open-source text processing library for advanced natural language processing.",
      ["Spark NLP", "text processing library", "natural language processing"]
]]).toDF("text", "target")

pipeline = Pipeline().setStages([documentAssembler, chunkAssembler]).fit(data)
result = pipeline.transform(data)

result.selectExpr("chunk.result", "chunk.annotatorType").show(truncate=False)
+-----------------------------------------------------------------+---------------------+
|result                                                           |annotatorType        |
+-----------------------------------------------------------------+---------------------+
|[Spark NLP, text processing library, natural language processing]|[chunk, chunk, chunk]|
+-----------------------------------------------------------------+---------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.{Doc2Chunk, DocumentAssembler}
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document")
val chunkAssembler = new Doc2Chunk()
  .setInputCols("document")
  .setChunkCol("target")
  .setOutputCol("chunk")
  .setIsArray(true)

val data = Seq(
  ("Spark NLP is an open-source text processing library for advanced natural language processing.",
    Seq("Spark NLP", "text processing library", "natural language processing"))
).toDF("text", "target")

val pipeline = new Pipeline().setStages(Array(documentAssembler, chunkAssembler)).fit(data)
val result = pipeline.transform(data)

result.selectExpr("chunk.result", "chunk.annotatorType").show(false)
+-----------------------------------------------------------------+---------------------+
|result                                                           |annotatorType        |
+-----------------------------------------------------------------+---------------------+
|[Spark NLP, text processing library, natural language processing]|[chunk, chunk, chunk]|
+-----------------------------------------------------------------+---------------------+

Doc2Vec

Trains a Word2Vec model that creates vector representations of words in a text corpus.

The algorithm first constructs a vocabulary from the corpus and then learns vector representation of words in the vocabulary. The vector representation can be used as features in natural language processing and machine learning algorithms.

We use Word2Vec implemented in Spark ML. It uses skip-gram model in our implementation and a hierarchical softmax method to train the model. The variable names in the implementation match the original C implementation.

For instantiated/pretrained models, see Doc2VecModel.

Sources :

For the original C implementation, see https://code.google.com/p/word2vec/

For the research paper, see Efficient Estimation of Word Representations in Vector Space and Distributed Representations of Words and Phrases and their Compositionality.

Input Annotator Types: TOKEN

Output Annotator Type: SENTENCE_EMBEDDINGS

Python API: Doc2VecApproach Scala API: Doc2VecApproach Source: Doc2VecApproach
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

embeddings = Doc2VecApproach() \
    .setInputCols(["token"]) \
    .setOutputCol("embeddings")

pipeline = Pipeline() \
    .setStages([
      documentAssembler,
      tokenizer,
      embeddings
    ])

path = "sherlockholmes.txt"
dataset = spark.read.text(path).toDF("text")
pipelineModel = pipeline.fit(dataset)
import spark.implicits._
import com.johnsnowlabs.nlp.annotator.{Tokenizer, Doc2VecApproach}
import com.johnsnowlabs.nlp.base.DocumentAssembler
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols(Array("document"))
  .setOutputCol("token")

val embeddings = new Doc2VecApproach()
  .setInputCols("token")
  .setOutputCol("embeddings")

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    tokenizer,
    embeddings
  ))

val path = "src/test/resources/spell/sherlockholmes.txt"
val dataset = spark.sparkContext.textFile(path)
  .toDF("text")
val pipelineModel = pipeline.fit(dataset)

DocumentAssembler

Prepares data into a format that is processable by Spark NLP. This is the entry point for every Spark NLP pipeline. The DocumentAssembler reads String columns. Additionally, setCleanupMode can be used to pre-process the text (Default: disabled). For possible options please refer the parameters section.

For more extended examples on document pre-processing see the Examples.

Input Annotator Types: NONE

Output Annotator Type: DOCUMENT

Python API: DocumentAssembler Scala API: DocumentAssembler Source: DocumentAssembler
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

data = spark.createDataFrame([["Spark NLP is an open-source text processing library."]]).toDF("text")
documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")

result = documentAssembler.transform(data)

result.select("document").show(truncate=False)
+----------------------------------------------------------------------------------------------+
|document                                                                                      |
+----------------------------------------------------------------------------------------------+
|[[document, 0, 51, Spark NLP is an open-source text processing library., [sentence -> 0], []]]|
+----------------------------------------------------------------------------------------------+

result.select("document").printSchema
root
 |-- document: array (nullable = True)
 |    |-- element: struct (containsNull = True)
 |    |    |-- annotatorType: string (nullable = True)
 |    |    |-- begin: integer (nullable = False)
 |    |    |-- end: integer (nullable = False)
 |    |    |-- result: string (nullable = True)
 |    |    |-- metadata: map (nullable = True)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = True)
 |    |    |-- embeddings: array (nullable = True)
 |    |    |    |-- element: float (containsNull = False)
import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler

val data = Seq("Spark NLP is an open-source text processing library.").toDF("text")
val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document")

val result = documentAssembler.transform(data)

result.select("document").show(false)
+----------------------------------------------------------------------------------------------+
|document                                                                                      |
+----------------------------------------------------------------------------------------------+
|[[document, 0, 51, Spark NLP is an open-source text processing library., [sentence -> 0], []]]|
+----------------------------------------------------------------------------------------------+

result.select("document").printSchema
root
 |-- document: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)

DocumentCharacterTextSplitter

Annotator which splits large documents into chunks of roughly given size.

DocumentCharacterTextSplitter takes a list of separators. It takes the separators in order and splits subtexts if they are over the chunk length, considering optional overlap of the chunks.

For example, given chunk size 20 and overlap 5:

"He was, I take it, the most perfect reasoning and observing machine that the world has seen."

["He was, I take it,", "it, the most", "most perfect", "reasoning and", "and observing", "machine that the", "the world has seen."]

Additionally, you can set

  • custom patterns with setSplitPatterns
  • whether patterns should be interpreted as regex with setPatternsAreRegex
  • whether to keep the separators with setKeepSeparators
  • whether to trim whitespaces with setTrimWhitespace
  • whether to explode the splits to individual rows with setExplodeSplits

For extended examples of usage, see the DocumentCharacterTextSplitterTest.

Input Annotator Types: DOCUMENT

Output Annotator Type: DOCUMENT

Python API: DocumentCharacterTextSplitter Scala API: DocumentCharacterTextSplitter Source: DocumentCharacterTextSplitter
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

textDF = spark.read.text(
   "sherlockholmes.txt",
   wholetext=True
).toDF("text")

documentAssembler = DocumentAssembler().setInputCol("text")

textSplitter = DocumentCharacterTextSplitter() \
    .setInputCols(["document"]) \
    .setOutputCol("splits") \
    .setChunkSize(20000) \
    .setChunkOverlap(200) \
    .setExplodeSplits(True)

pipeline = Pipeline().setStages([documentAssembler, textSplitter])
result = pipeline.fit(textDF).transform(textDF)
result.selectExpr(
      "splits.result",
      "splits[0].begin",
      "splits[0].end",
      "splits[0].end - splits[0].begin as length") \
    .show(8, truncate = 80)
+--------------------------------------------------------------------------------+---------------+-------------+------+
|                                                                          result|splits[0].begin|splits[0].end|length|
+--------------------------------------------------------------------------------+---------------+-------------+------+
|[ Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyl...|              0|        19994| 19994|
|["And Mademoiselle's address?" he asked.\n\n"Is Briony Lodge, Serpentine Aven...|          19798|        39395| 19597|
|["How did that help you?"\n\n"It was all-important. When a woman thinks that ...|          39371|        59242| 19871|
|["'But,' said I, 'there would be millions of red-headed men who\nwould apply....|          59166|        77833| 18667|
|[My friend was an enthusiastic musician, being himself not only a\nvery capab...|          77835|        97769| 19934|
|["And yet I am not convinced of it," I answered. "The cases which\ncome to li...|          97771|       117248| 19477|
|["Well, she had a slate-coloured, broad-brimmed straw hat, with a\nfeather of...|         117250|       137242| 19992|
|["That sounds a little paradoxical."\n\n"But it is profoundly True. Singulari...|         137244|       157171| 19927|
+--------------------------------------------------------------------------------+---------------+-------------+------+
import com.johnsnowlabs.nlp.annotator._
import com.johnsnowlabs.nlp.DocumentAssembler
import org.apache.spark.ml.Pipeline

val textDF =
  spark.read
    .option("wholetext", "true")
    .text("src/test/resources/spell/sherlockholmes.txt")
    .toDF("text")

val documentAssembler = new DocumentAssembler().setInputCol("text")
val textSplitter = new DocumentCharacterTextSplitter()
  .setInputCols("document")
  .setOutputCol("splits")
  .setChunkSize(20000)
  .setChunkOverlap(200)
  .setExplodeSplits(true)

val pipeline = new Pipeline().setStages(Array(documentAssembler, textSplitter))
val result = pipeline.fit(textDF).transform(textDF)

result
  .selectExpr(
    "splits.result",
    "splits[0].begin",
    "splits[0].end",
    "splits[0].end - splits[0].begin as length")
  .show(8, truncate = 80)
+--------------------------------------------------------------------------------+---------------+-------------+------+
|                                                                          result|splits[0].begin|splits[0].end|length|
+--------------------------------------------------------------------------------+---------------+-------------+------+
|[ Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyl...|              0|        19994| 19994|
|["And Mademoiselle's address?" he asked.\n\n"Is Briony Lodge, Serpentine Aven...|          19798|        39395| 19597|
|["How did that help you?"\n\n"It was all-important. When a woman thinks that ...|          39371|        59242| 19871|
|["'But,' said I, 'there would be millions of red-headed men who\nwould apply....|          59166|        77833| 18667|
|[My friend was an enthusiastic musician, being himself not only a\nvery capab...|          77835|        97769| 19934|
|["And yet I am not convinced of it," I answered. "The cases which\ncome to li...|          97771|       117248| 19477|
|["Well, she had a slate-coloured, broad-brimmed straw hat, with a\nfeather of...|         117250|       137242| 19992|
|["That sounds a little paradoxical."\n\n"But it is profoundly true. Singulari...|         137244|       157171| 19927|
+--------------------------------------------------------------------------------+---------------+-------------+------+

DocumentNormalizer

Annotator which normalizes raw text from tagged text, e.g. scraped web pages or xml documents, from document type columns into Sentence. Removes all dirty characters from text following one or more input regex patterns. Can apply not wanted character removal with a specific policy. Can apply lower case normalization.

For extended examples of usage, see the Examples.

Input Annotator Types: DOCUMENT

Output Annotator Type: DOCUMENT

Python API: DocumentNormalizer Scala API: DocumentNormalizer Source: DocumentNormalizer
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

cleanUpPatterns = ["<[^>]>"]

documentNormalizer = DocumentNormalizer() \
    .setInputCols("document") \
    .setOutputCol("normalizedDocument") \
    .setAction("clean") \
    .setPatterns(cleanUpPatterns) \
    .setReplacement(" ") \
    .setPolicy("pretty_all") \
    .setLowercase(True)

pipeline = Pipeline().setStages([
    documentAssembler,
    documentNormalizer
])

text = """
<div id="theworldsgreatest" class='my-right my-hide-small my-wide toptext' style="font-family:'Segoe UI',Arial,sans-serif">
    THE WORLD'S LARGEST WEB DEVELOPER SITE
    <h1 style="font-size:300%;">THE WORLD'S LARGEST WEB DEVELOPER SITE</h1>
    <p style="font-size:160%;">Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum..</p>
</div>

</div>"""
data = spark.createDataFrame([[text]]).toDF("text")
pipelineModel = pipeline.fit(data)

result = pipelineModel.transform(data)
result.selectExpr("normalizedDocument.result").show(truncate=False)
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[ the world's largest web developer site the world's largest web developer site lorem ipsum is simply dummy text of the printing and typesetting industry. lorem ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. it has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. it was popularised in the 1960s with the release of letraset sheets containing lorem ipsum passages, and more recently with desktop publishing software like aldus pagemaker including versions of lorem ipsum..]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.DocumentNormalizer
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val cleanUpPatterns = Array("<[^>]>")

val documentNormalizer = new DocumentNormalizer()
  .setInputCols("document")
  .setOutputCol("normalizedDocument")
  .setAction("clean")
  .setPatterns(cleanUpPatterns)
  .setReplacement(" ")
  .setPolicy("pretty_all")
  .setLowercase(true)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  documentNormalizer
))

val text =
  """
<div id="theworldsgreatest" class='my-right my-hide-small my-wide toptext' style="font-family:'Segoe UI',Arial,sans-serif">
  THE WORLD'S LARGEST WEB DEVELOPER SITE
  <h1 style="font-size:300%;">THE WORLD'S LARGEST WEB DEVELOPER SITE</h1>
  <p style="font-size:160%;">Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum..</p>
</div>

</div>"""
val data = Seq(text).toDF("text")
val pipelineModel = pipeline.fit(data)

val result = pipelineModel.transform(data)
result.selectExpr("normalizedDocument.result").show(truncate=false)
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[ the world's largest web developer site the world's largest web developer site lorem ipsum is simply dummy text of the printing and typesetting industry. lorem ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. it has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. it was popularised in the 1960s with the release of letraset sheets containing lorem ipsum passages, and more recently with desktop publishing software like aldus pagemaker including versions of lorem ipsum..]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

DocumentSimilarityRanker

Annotator that uses LSH techniques present in Spark ML lib to execute approximate nearest neighbors search on top of sentence embeddings.

It aims to capture the semantic meaning of a document in a dense, continuous vector space and return it to the ranker search.

For instantiated/pretrained models, see DocumentSimilarityRankerModel.

For extended examples of usage, see the jupyter notebook Document Similarity Ranker for Spark NLP.

Input Annotator Types: SENTENCE_EMBEDDINGS

Output Annotator Type: DOC_SIMILARITY_RANKINGS

Python API: DocumentSimilarityRankerApproach Scala API: DocumentSimilarityRankerApproach Source: DocumentSimilarityRankerApproach
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
from sparknlp.annotator.similarity.document_similarity_ranker import *

document_assembler = DocumentAssembler() \
            .setInputCol("text") \
            .setOutputCol("document")
sentence_embeddings = E5Embeddings.pretrained() \
            .setInputCols(["document"]) \
            .setOutputCol("sentence_embeddings")
document_similarity_ranker = DocumentSimilarityRankerApproach() \
            .setInputCols("sentence_embeddings") \
            .setOutputCol("doc_similarity_rankings") \
            .setSimilarityMethod("brp") \
            .setNumberOfNeighbours(1) \
            .setBucketLength(2.0) \
            .setNumHashTables(3) \
            .setVisibleDistances(True) \
            .setIdentityRanking(False)
document_similarity_ranker_finisher = DocumentSimilarityRankerFinisher() \
        .setInputCols("doc_similarity_rankings") \
        .setOutputCols(
            "finished_doc_similarity_rankings_id",
            "finished_doc_similarity_rankings_neighbors") \
        .setExtractNearestNeighbor(True)
pipeline = Pipeline(stages=[
            document_assembler,
            sentence_embeddings,
            document_similarity_ranker,
            document_similarity_ranker_finisher
        ])
docSimRankerPipeline = pipeline.fit(data).transform(data)

(
    docSimRankerPipeline
        .select(
               "finished_doc_similarity_rankings_id",
               "finished_doc_similarity_rankings_neighbors"
        ).show(10, False)
)
+-----------------------------------+------------------------------------------+
|finished_doc_similarity_rankings_id|finished_doc_similarity_rankings_neighbors|
+-----------------------------------+------------------------------------------+
|1510101612                         |[(1634839239,0.12448559591306324)]        |
|1634839239                         |[(1510101612,0.12448559591306324)]        |
|-612640902                         |[(1274183715,0.1220122862046063)]         |
|1274183715                         |[(-612640902,0.1220122862046063)]         |
|-1320876223                        |[(1293373212,0.17848855164122393)]        |
|1293373212                         |[(-1320876223,0.17848855164122393)]       |
|-1548374770                        |[(-1719102856,0.23297156732534166)]       |
|-1719102856                        |[(-1548374770,0.23297156732534166)]       |
+-----------------------------------+------------------------------------------+
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import com.johnsnowlabs.nlp.annotators.similarity.DocumentSimilarityRankerApproach
import com.johnsnowlabs.nlp.finisher.DocumentSimilarityRankerFinisher
import org.apache.spark.ml.Pipeline

import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceEmbeddings = RoBertaSentenceEmbeddings
  .pretrained()
  .setInputCols("document")
  .setOutputCol("sentence_embeddings")

val documentSimilarityRanker = new DocumentSimilarityRankerApproach()
  .setInputCols("sentence_embeddings")
  .setOutputCol("doc_similarity_rankings")
  .setSimilarityMethod("brp")
  .setNumberOfNeighbours(1)
  .setBucketLength(2.0)
  .setNumHashTables(3)
  .setVisibleDistances(true)
  .setIdentityRanking(false)

val documentSimilarityRankerFinisher = new DocumentSimilarityRankerFinisher()
  .setInputCols("doc_similarity_rankings")
  .setOutputCols(
    "finished_doc_similarity_rankings_id",
    "finished_doc_similarity_rankings_neighbors")
  .setExtractNearestNeighbor(true)

// Let's use a dataset where we can visually control similarity
// Documents are coupled, as 1-2, 3-4, 5-6, 7-8 and they were create to be similar on purpose
val data = Seq(
  "First document, this is my first sentence. This is my second sentence.",
  "Second document, this is my second sentence. This is my second sentence.",
  "Third document, climate change is arguably one of the most pressing problems of our time.",
  "Fourth document, climate change is definitely one of the most pressing problems of our time.",
  "Fifth document, Florence in Italy, is among the most beautiful cities in Europe.",
  "Sixth document, Florence in Italy, is a very beautiful city in Europe like Lyon in France.",
  "Seventh document, the French Riviera is the Mediterranean coastline of the southeast corner of France.",
  "Eighth document, the warmest place in France is the French Riviera coast in Southern France.")
  .toDF("text")

val pipeline = new Pipeline().setStages(
  Array(
    documentAssembler,
    sentenceEmbeddings,
    documentSimilarityRanker,
    documentSimilarityRankerFinisher))

val result = pipeline.fit(data).transform(data)

result
  .select("finished_doc_similarity_rankings_id", "finished_doc_similarity_rankings_neighbors")
  .show(10, truncate = false)
+-----------------------------------+------------------------------------------+
|finished_doc_similarity_rankings_id|finished_doc_similarity_rankings_neighbors|
+-----------------------------------+------------------------------------------+
|1510101612                         |[(1634839239,0.12448559591306324)]        |
|1634839239                         |[(1510101612,0.12448559591306324)]        |
|-612640902                         |[(1274183715,0.1220122862046063)]         |
|1274183715                         |[(-612640902,0.1220122862046063)]         |
|-1320876223                        |[(1293373212,0.17848855164122393)]        |
|1293373212                         |[(-1320876223,0.17848855164122393)]       |
|-1548374770                        |[(-1719102856,0.23297156732534166)]       |
|-1719102856                        |[(-1548374770,0.23297156732534166)]       |
+-----------------------------------+------------------------------------------+

DocumentTokenSplitter

Annotator that splits large documents into smaller documents based on the number of tokens in the text.

Currently, DocumentTokenSplitter splits the text by whitespaces to create the tokens. The number of these tokens will then be used as a measure of the text length. In the future, other tokenization techniques will be supported.

For example, given 3 tokens and overlap 1:

"He was, I take it, the most perfect reasoning and observing machine that the world has seen."

["He was, I", "I take it,", "it, the most", "most perfect reasoning", "reasoning and observing", "observing machine that", "that the world", "world has seen."]

Additionally, you can set

  • whether to trim whitespaces with setTrimWhitespace
  • whether to explode the splits to individual rows with setExplodeSplits

For extended examples of usage, see the DocumentTokenSplitterTest.

Input Annotator Types: DOCUMENT

Output Annotator Type: DOCUMENT

Python API: DocumentTokenSplitter Scala API: DocumentTokenSplitter Source: DocumentTokenSplitter
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

textDF = spark.read.text(
   "sherlockholmes.txt",
   wholetext=True
).toDF("text")

documentAssembler = DocumentAssembler().setInputCol("text")

textSplitter = DocumentTokenSplitter() \
    .setInputCols(["document"]) \
    .setOutputCol("splits") \
    .setNumTokens(512) \
    .setTokenOverlap(10) \
    .setExplodeSplits(True)

pipeline = Pipeline().setStages([documentAssembler, textSplitter])

result = pipeline.fit(textDF).transform(textDF)
result.selectExpr(
      "splits.result as result",
      "splits[0].begin as begin",
      "splits[0].end as end",
      "splits[0].end - splits[0].begin as length",
      "splits[0].metadata.numTokens as tokens") \
    .show(8, truncate = 80)
+--------------------------------------------------------------------------------+-----+-----+------+------+
|                                                                          result|begin|  end|length|tokens|
+--------------------------------------------------------------------------------+-----+-----+------+------+
|[ Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyl...|    0| 3018|  3018|   512|
|[study of crime, and occupied his\nimmense faculties and extraordinary powers...| 2950| 5707|  2757|   512|
|[but as I have changed my clothes I can't imagine how you\ndeduce it. As to M...| 5659| 8483|  2824|   512|
|[quarters received. Be in your chamber then at that hour, and do\nnot take it...| 8427|11241|  2814|   512|
|[a pity\nto miss it."\n\n"But your client--"\n\n"Never mind him. I may want y...|11188|13970|  2782|   512|
|[person who employs me wishes his agent to be unknown to\nyou, and I may conf...|13918|16898|  2980|   512|
|[letters back."\n\n"Precisely so. But how--"\n\n"Was there a secret marriage?...|16836|19744|  2908|   512|
|[seven hundred in\nnotes," he said.\n\nHolmes scribbled a receipt upon a shee...|19683|22551|  2868|   512|
+--------------------------------------------------------------------------------+-----+-----+------+------+
import com.johnsnowlabs.nlp.annotator._
import com.johnsnowlabs.nlp.DocumentAssembler
import org.apache.spark.ml.Pipeline

val textDF =
  spark.read
    .option("wholetext", "true")
    .text("src/test/resources/spell/sherlockholmes.txt")
    .toDF("text")

val documentAssembler = new DocumentAssembler().setInputCol("text")
val textSplitter = new DocumentTokenSplitter()
  .setInputCols("document")
  .setOutputCol("splits")
  .setNumTokens(512)
  .setTokenOverlap(10)
  .setExplodeSplits(true)

val pipeline = new Pipeline().setStages(Array(documentAssembler, textSplitter))
val result = pipeline.fit(textDF).transform(textDF)

result
  .selectExpr(
    "splits.result as result",
    "splits[0].begin as begin",
    "splits[0].end as end",
    "splits[0].end - splits[0].begin as length",
    "splits[0].metadata.numTokens as tokens")
  .show(8, truncate = 80)
+--------------------------------------------------------------------------------+-----+-----+------+------+
|                                                                          result|begin|  end|length|tokens|
+--------------------------------------------------------------------------------+-----+-----+------+------+
|[ Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyl...|    0| 3018|  3018|   512|
|[study of crime, and occupied his\nimmense faculties and extraordinary powers...| 2950| 5707|  2757|   512|
|[but as I have changed my clothes I can't imagine how you\ndeduce it. As to M...| 5659| 8483|  2824|   512|
|[quarters received. Be in your chamber then at that hour, and do\nnot take it...| 8427|11241|  2814|   512|
|[a pity\nto miss it."\n\n"But your client--"\n\n"Never mind him. I may want y...|11188|13970|  2782|   512|
|[person who employs me wishes his agent to be unknown to\nyou, and I may conf...|13918|16898|  2980|   512|
|[letters back."\n\n"Precisely so. But how--"\n\n"Was there a secret marriage?...|16836|19744|  2908|   512|
|[seven hundred in\nnotes," he said.\n\nHolmes scribbled a receipt upon a shee...|19683|22551|  2868|   512|
+--------------------------------------------------------------------------------+-----+-----+------+------+

EmbeddingsFinisher

Extracts embeddings from Annotations into a more easily usable form.

This is useful for example: WordEmbeddings, BertEmbeddings, SentenceEmbeddings and ChunkEmbeddings.

By using EmbeddingsFinisher you can easily transform your embeddings into array of floats or vectors which are compatible with Spark ML functions such as LDA, K-mean, Random Forest classifier or any other functions that require featureCol.

For more extended examples see the Examples.

Input Annotator Types: EMBEDDINGS

Output Annotator Type: NONE

Python API: EmbeddingsFinisher Scala API: EmbeddingsFinisher Source: EmbeddingsFinisher
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols("document") \
    .setOutputCol("token")

normalizer = Normalizer() \
    .setInputCols("token") \
    .setOutputCol("normalized")

stopwordsCleaner = StopWordsCleaner() \
    .setInputCols("normalized") \
    .setOutputCol("cleanTokens") \
    .setCaseSensitive(False)

gloveEmbeddings = WordEmbeddingsModel.pretrained() \
    .setInputCols("document", "cleanTokens") \
    .setOutputCol("embeddings") \
    .setCaseSensitive(False)

embeddingsFinisher = EmbeddingsFinisher() \
    .setInputCols("embeddings") \
    .setOutputCols("finished_sentence_embeddings") \
    .setOutputAsVector(True) \
    .setCleanAnnotations(False)

data = spark.createDataFrame([["Spark NLP is an open-source text processing library."]]) \
    .toDF("text")
pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    normalizer,
    stopwordsCleaner,
    gloveEmbeddings,
    embeddingsFinisher
]).fit(data)

result = pipeline.transform(data)
resultWithSize = result.selectExpr("explode(finished_sentence_embeddings) as embeddings")

resultWithSize.show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                      embeddings|
+--------------------------------------------------------------------------------+
|[0.1619900017976761,0.045552998781204224,-0.03229299932718277,-0.685609996318...|
|[-0.42416998744010925,1.1378999948501587,-0.5717899799346924,-0.5078899860382...|
|[0.08621499687433243,-0.15772999823093414,-0.06067200005054474,0.395359992980...|
|[-0.4970499873161316,0.7164199948310852,0.40119001269340515,-0.05761000141501...|
|[-0.08170200139284134,0.7159299850463867,-0.20677000284194946,0.0295659992843...|
+--------------------------------------------------------------------------------+
import spark.implicits._
import org.apache.spark.ml.Pipeline
import com.johnsnowlabs.nlp.{DocumentAssembler, EmbeddingsFinisher}
import com.johnsnowlabs.nlp.annotator.{Normalizer, StopWordsCleaner, Tokenizer, WordEmbeddingsModel}

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val normalizer = new Normalizer()
  .setInputCols("token")
  .setOutputCol("normalized")

val stopwordsCleaner = new StopWordsCleaner()
  .setInputCols("normalized")
  .setOutputCol("cleanTokens")
  .setCaseSensitive(false)

val gloveEmbeddings = WordEmbeddingsModel.pretrained()
  .setInputCols("document", "cleanTokens")
  .setOutputCol("embeddings")
  .setCaseSensitive(false)

val embeddingsFinisher = new EmbeddingsFinisher()
  .setInputCols("embeddings")
  .setOutputCols("finished_sentence_embeddings")
  .setOutputAsVector(true)
  .setCleanAnnotations(false)

val data = Seq("Spark NLP is an open-source text processing library.")
  .toDF("text")
val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  tokenizer,
  normalizer,
  stopwordsCleaner,
  gloveEmbeddings,
  embeddingsFinisher
)).fit(data)

val result = pipeline.transform(data)
val resultWithSize = result.selectExpr("explode(finished_sentence_embeddings)")
  .map { row =>
    val vector = row.getAs[org.apache.spark.ml.linalg.DenseVector](0)
    (vector.size, vector)
  }.toDF("size", "vector")

resultWithSize.show(5, 80)
+----+--------------------------------------------------------------------------------+
|size|                                                                          vector|
+----+--------------------------------------------------------------------------------+
| 100|[0.1619900017976761,0.045552998781204224,-0.03229299932718277,-0.685609996318...|
| 100|[-0.42416998744010925,1.1378999948501587,-0.5717899799346924,-0.5078899860382...|
| 100|[0.08621499687433243,-0.15772999823093414,-0.06067200005054474,0.395359992980...|
| 100|[-0.4970499873161316,0.7164199948310852,0.40119001269340515,-0.05761000141501...|
| 100|[-0.08170200139284134,0.7159299850463867,-0.20677000284194946,0.0295659992843...|
+----+--------------------------------------------------------------------------------+

EntityRuler

Fits an Annotator to match exact strings or regex patterns provided in a file against a Document and assigns them an named entity. The definitions can contain any number of named entities.

There are multiple ways and formats to set the extraction resource. It is possible to set it either as a “JSON”, “JSONL” or “CSV” file. A path to the file needs to be provided to setPatternsResource. The file format needs to be set as the “format” field in the option parameter map and depending on the file type, additional parameters might need to be set.

If the file is in a JSON format, then the rule definitions need to be given in a list with the fields “id”, “label” and “patterns”:

 [
  {
    "id": "person-regex",
    "label": "PERSON",
    "patterns": ["\\w+\\s\\w+", "\\w+-\\w+"]
  },
  {
    "id": "locations-words",
    "label": "LOCATION",
    "patterns": ["Winterfell"]
  }
]

The same fields also apply to a file in the JSONL format:

{"id": "names-with-j", "label": "PERSON", "patterns": ["Jon", "John", "John Snow"]}
{"id": "names-with-s", "label": "PERSON", "patterns": ["Stark", "Snow"]}
{"id": "names-with-e", "label": "PERSON", "patterns": ["Eddard", "Eddard Stark"]}

In order to use a CSV file, an additional parameter “delimiter” needs to be set. In this case, the delimiter might be set by using .setPatternsResource("patterns.csv", ReadAs.TEXT, Map("format"->"csv", "delimiter" -> "\\|"))

PERSON|Jon
PERSON|John
PERSON|John Snow
LOCATION|Winterfell

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: CHUNK

Python API: EntityRulerApproach Scala API: EntityRulerApproach Source: EntityRulerApproach
Show Example
# In this example, the entities file as the form of
#
# PERSON|Jon
# PERSON|John
# PERSON|John Snow
# LOCATION|Winterfell
#
# where each line represents an entity and the associated string delimited by "|".

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.common import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")
tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")
entityRuler = EntityRulerApproach() \
    .setInputCols(["document", "token"]) \
    .setOutputCol("entities") \
    .setPatternsResource(
      "patterns.csv",
      ReadAs.TEXT,
      {"format": "csv", "delimiter": "\\|"}
    )
pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    entityRuler
])
data = spark.createDataFrame([["Jon Snow wants to be lord of Winterfell."]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("explode(entities)").show(truncate=False)
+--------------------------------------------------------------------+
|col                                                                 |
+--------------------------------------------------------------------+
|[chunk, 0, 2, Jon, [entity -> PERSON, sentence -> 0], []]           |
|[chunk, 29, 38, Winterfell, [entity -> LOCATION, sentence -> 0], []]|
+--------------------------------------------------------------------+
// In this example, the entities file as the form of
//
// PERSON|Jon
// PERSON|John
// PERSON|John Snow
// LOCATION|Winterfell
//
// where each line represents an entity and the associated string delimited by "|".

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.er.EntityRulerApproach
import com.johnsnowlabs.nlp.util.io.ReadAs

import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val entityRuler = new EntityRulerApproach()
  .setInputCols("document", "token")
  .setOutputCol("entities")
  .setPatternsResource(
    "src/test/resources/entity-ruler/patterns.csv",
    ReadAs.TEXT,
    {"format": "csv", "delimiter": "|")}
  )

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  tokenizer,
  entityRuler
))

val data = Seq("Jon Snow wants to be lord of Winterfell.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(entities)").show(false)
+--------------------------------------------------------------------+
|col                                                                 |
+--------------------------------------------------------------------+
|[chunk, 0, 2, Jon, [entity -> PERSON, sentence -> 0], []]           |
|[chunk, 29, 38, Winterfell, [entity -> LOCATION, sentence -> 0], []]|
+--------------------------------------------------------------------+

Finisher

Converts annotation results into a format that easier to use. It is useful to extract the results from Spark NLP Pipelines. The Finisher outputs annotation(s) values into String.

For more extended examples on document pre-processing see the Examples.

Input Annotator Types: ANY

Output Annotator Type: NONE

Python API: Finisher Scala API: Finisher Source: Finisher
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
from sparknlp.pretrained import PretrainedPipeline

data = spark.createDataFrame([[1, "New York and New Jersey aren't that far apart actually."]]).toDF("id", "text")

# Extracts Named Entities amongst other things
pipeline = PretrainedPipeline("explain_document_dl")

finisher = Finisher().setInputCols("entities").setOutputCols("output")
explainResult = pipeline.transform(data)

explainResult.selectExpr("explode(entities)").show(truncate=False)
+------------------------------------------------------------------------------------------------------------------------------------------------------+
|entities                                                                                                                                              |
+------------------------------------------------------------------------------------------------------------------------------------------------------+
|[[chunk, 0, 7, New York, [entity -> LOC, sentence -> 0, chunk -> 0], []], [chunk, 13, 22, New Jersey, [entity -> LOC, sentence -> 0, chunk -> 1], []]]|
+------------------------------------------------------------------------------------------------------------------------------------------------------+

result = finisher.transform(explainResult)
result.select("output").show(truncate=False)
+----------------------+
|output                |
+----------------------+
|[New York, New Jersey]|
+----------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
import com.johnsnowlabs.nlp.Finisher

val data = Seq((1, "New York and New Jersey aren't that far apart actually.")).toDF("id", "text")

// Extracts Named Entities amongst other things
val pipeline = PretrainedPipeline("explain_document_dl")

val finisher = new Finisher().setInputCols("entities").setOutputCols("output")
val explainResult = pipeline.transform(data)

explainResult.selectExpr("explode(entities)").show(false)
+------------------------------------------------------------------------------------------------------------------------------------------------------+
|entities                                                                                                                                              |
+------------------------------------------------------------------------------------------------------------------------------------------------------+
|[[chunk, 0, 7, New York, [entity -> LOC, sentence -> 0, chunk -> 0], []], [chunk, 13, 22, New Jersey, [entity -> LOC, sentence -> 0, chunk -> 1], []]]|
+------------------------------------------------------------------------------------------------------------------------------------------------------+

val result = finisher.transform(explainResult)
result.select("output").show(false)
+----------------------+
|output                |
+----------------------+
|[New York, New Jersey]|
+----------------------+

GraphExtraction

Extracts a dependency graph between entities.

The GraphExtraction class takes e.g. extracted entities from a NerDLModel and creates a dependency tree which describes how the entities relate to each other. For that a triple store format is used. Nodes represent the entities and the edges represent the relations between those entities. The graph can then be used to find relevant relationships between words.

Both the DependencyParserModel and TypedDependencyParserModel need to be present in the pipeline. There are two ways to set them:

  1. Both Annotators are present in the pipeline already. The dependencies are taken implicitly from these two Annotators.
  2. Setting setMergeEntities to true will download the default pretrained models for those two Annotators automatically. The specific models can also be set with setDependencyParserModel and setTypedDependencyParserModel:
          val graph_extraction = new GraphExtraction()
            .setInputCols("document", "token", "ner")
            .setOutputCol("graph")
            .setRelationshipTypes(Array("prefer-LOC"))
            .setMergeEntities(true)
          //.setDependencyParserModel(Array("dependency_conllu", "en",  "public/models"))
          //.setTypedDependencyParserModel(Array("dependency_typed_conllu", "en",  "public/models"))
    

To transform the resulting graph into a more generic form such as RDF, see the GraphFinisher.

Input Annotator Types: DOCUMENT, TOKEN, NAMED_ENTITY

Output Annotator Type: NODE

Python API: GraphExtraction Scala API: GraphExtraction Source: GraphExtraction
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

embeddings = WordEmbeddingsModel.pretrained() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

nerTagger = NerDLModel.pretrained() \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

posTagger = PerceptronModel.pretrained() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("pos")

dependencyParser = DependencyParserModel.pretrained() \
    .setInputCols(["sentence", "pos", "token"]) \
    .setOutputCol("dependency")

typedDependencyParser = TypedDependencyParserModel.pretrained() \
    .setInputCols(["dependency", "pos", "token"]) \
    .setOutputCol("dependency_type")

graph_extraction = GraphExtraction() \
    .setInputCols(["document", "token", "ner"]) \
    .setOutputCol("graph") \
    .setRelationshipTypes(["prefer-LOC"])

pipeline = Pipeline().setStages([
    documentAssembler,
    sentence,
    tokenizer,
    embeddings,
    nerTagger,
    posTagger,
    dependencyParser,
    typedDependencyParser,
    graph_extraction
])

data = spark.createDataFrame([["You and John prefer the morning flight through Denver"]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.select("graph").show(truncate=False)
+-----------------------------------------------------------------------------------------------------------------+
|graph                                                                                                            |
+-----------------------------------------------------------------------------------------------------------------+
|13, 18, prefer, [relationship -> prefer,LOC, path1 -> prefer,nsubj,morning,flat,flight,flat,Denver], []|
+-----------------------------------------------------------------------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.ner.dl.NerDLModel
import com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel
import com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel
import com.johnsnowlabs.nlp.annotators.parser.dep.DependencyParserModel
import com.johnsnowlabs.nlp.annotators.parser.typdep.TypedDependencyParserModel
import org.apache.spark.ml.Pipeline
import com.johnsnowlabs.nlp.annotators.GraphExtraction

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentence = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

val embeddings = WordEmbeddingsModel.pretrained()
  .setInputCols("sentence", "token")
  .setOutputCol("embeddings")

val nerTagger = NerDLModel.pretrained()
  .setInputCols("sentence", "token", "embeddings")
  .setOutputCol("ner")

val posTagger = PerceptronModel.pretrained()
  .setInputCols("sentence", "token")
  .setOutputCol("pos")

val dependencyParser = DependencyParserModel.pretrained()
  .setInputCols("sentence", "pos", "token")
  .setOutputCol("dependency")

val typedDependencyParser = TypedDependencyParserModel.pretrained()
  .setInputCols("dependency", "pos", "token")
  .setOutputCol("dependency_type")

val graph_extraction = new GraphExtraction()
  .setInputCols("document", "token", "ner")
  .setOutputCol("graph")
  .setRelationshipTypes(Array("prefer-LOC"))

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentence,
  tokenizer,
  embeddings,
  nerTagger,
  posTagger,
  dependencyParser,
  typedDependencyParser,
  graph_extraction
))

val data = Seq("You and John prefer the morning flight through Denver").toDF("text")
val result = pipeline.fit(data).transform(data)

result.select("graph").show(false)
+-----------------------------------------------------------------------------------------------------------------+
|graph                                                                                                            |
+-----------------------------------------------------------------------------------------------------------------+
|[[node, 13, 18, prefer, [relationship -> prefer,LOC, path1 -> prefer,nsubj,morning,flat,flight,flat,Denver], []]]|
+-----------------------------------------------------------------------------------------------------------------+

GraphFinisher

Helper class to convert the knowledge graph from GraphExtraction into a generic format, such as RDF.

Input Annotator Types: NONE

Output Annotator Type: NONE

Python API: GraphFinisher Scala API: GraphFinisher Source: GraphFinisher
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
# This is a continuation of the example of
# GraphExtraction. To see how the graph is extracted, see the
# documentation of that class.

graphFinisher = GraphFinisher() \
    .setInputCol("graph") \
    .setOutputCol("graph_finished")
    .setOutputAs[False]

finishedResult = graphFinisher.transform(result)
finishedResult.select("text", "graph_finished").show(truncate=False)
+-----------------------------------------------------+-----------------------------------------------------------------------+
|text                                                 |graph_finished                                                         |
+-----------------------------------------------------+-----------------------------------------------------------------------+
|You and John prefer the morning flight through Denver|(morning,flat,flight), (flight,flat,Denver)|
+-----------------------------------------------------+-----------------------------------------------------------------------+
// This is a continuation of the example of
// [[com.johnsnowlabs.nlp.annotators.GraphExtraction GraphExtraction]]. To see how the graph is extracted, see the
// documentation of that class.
import com.johnsnowlabs.nlp.GraphFinisher

val graphFinisher = new GraphFinisher()
  .setInputCol("graph")
  .setOutputCol("graph_finished")
  .setOutputAsArray(false)

val finishedResult = graphFinisher.transform(result)
finishedResult.select("text", "graph_finished").show(false)
+-----------------------------------------------------+-----------------------------------------------------------------------+
|text                                                 |graph_finished                                                         |
+-----------------------------------------------------+-----------------------------------------------------------------------+
|You and John prefer the morning flight through Denver|[[(prefer,nsubj,morning), (morning,flat,flight), (flight,flat,Denver)]]|
+-----------------------------------------------------+-----------------------------------------------------------------------+

ImageAssembler

Prepares images read by Spark into a format that is processable by Spark NLP. This component is needed to process images.

Input Annotator Types: NONE

Output Annotator Type: IMAGE

Python API: ImageAssembler Scala API: ImageAssembler Source: ImageAssembler
Show Example
import sparknlp
from sparknlp.base import *
from pyspark.ml import Pipeline

data = spark.read.format("image").load("./tmp/images/").toDF("image")
imageAssembler = ImageAssembler().setInputCol("image").setOutputCol("image_assembler")

result = imageAssembler.transform(data)
result.select("image_assembler").show()
result.select("image_assembler").printSchema()
root
  |-- image_assembler: array (nullable = true)
  |    |-- element: struct (containsNull = true)
  |    |    |-- annotatorType: string (nullable = true)
  |    |    |-- origin: string (nullable = true)
  |    |    |-- height: integer (nullable = true)
  |    |    |-- width: integer (nullable = true)
  |    |    |-- nChannels: integer (nullable = true)
  |    |    |-- mode: integer (nullable = true)
  |    |    |-- result: binary (nullable = true)
  |    |    |-- metadata: map (nullable = true)
  |    |    |    |-- key: string
  |    |    |    |-- value: string (valueContainsNull = true)
import com.johnsnowlabs.nlp.ImageAssembler
import org.apache.spark.ml.Pipeline

val imageDF: DataFrame = spark.read
  .format("image")
  .option("dropInvalid", value = true)
  .load("src/test/resources/image/")

val imageAssembler = new ImageAssembler()
  .setInputCol("image")
  .setOutputCol("image_assembler")

val pipeline = new Pipeline().setStages(Array(imageAssembler))
val pipelineDF = pipeline.fit(imageDF).transform(imageDF)
pipelineDF.printSchema()
root
 |-- image_assembler: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- origin: string (nullable = true)
 |    |    |-- height: integer (nullable = false)
 |    |    |-- width: integer (nullable = false)
 |    |    |-- nChannels: integer (nullable = false)
 |    |    |-- mode: integer (nullable = false)
 |    |    |-- result: binary (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)

LanguageDetectorDL

Language Identification and Detection by using CNN and RNN architectures in TensorFlow.

LanguageDetectorDL is an annotator that detects the language of documents or sentences depending on the inputCols. The models are trained on large datasets such as Wikipedia and Tatoeba. Depending on the language (how similar the characters are), the LanguageDetectorDL works best with text longer than 140 characters. The output is a language code in Wiki Code style.

Pretrained models can be loaded with pretrained of the companion object:

Val languageDetector = LanguageDetectorDL.pretrained()
  .setInputCols("sentence")
  .setOutputCol("language")

The default model is "ld_wiki_tatoeba_cnn_21", default language is "xx" (meaning multi-lingual), if no values are provided. For available pretrained models please see the Models Hub.

For extended examples of usage, see the Examples And the LanguageDetectorDLTestSpec.

Input Annotator Types: DOCUMENT

Output Annotator Type: LANGUAGE

Python API: LanguageDetectorDL Scala API: LanguageDetectorDL Source: LanguageDetectorDL
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

languageDetector = LanguageDetectorDL.pretrained() \
    .setInputCols("document") \
    .setOutputCol("language")

pipeline = Pipeline() \
    .setStages([
      documentAssembler,
      languageDetector
    ])

data = spark.createDataFrame([
    ["Spark NLP is an open-source text processing library for advanced natural language processing for the Python, Java and Scala programming languages."],
    ["Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala."],
    ["Spark NLP ist eine Open-Source-Textverarbeitungsbibliothek für fortgeschrittene natürliche Sprachverarbeitung für die Programmiersprachen Python, Java und Scala."]
]).toDF("text")
result = pipeline.fit(data).transform(data)

result.select("language.result").show(truncate=False)
+------+
|result|
+------+
|[en]  |
|[fr]  |
|[de]  |
+------+
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.ld.dl.LanguageDetectorDL
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val languageDetector = LanguageDetectorDL.pretrained()
  .setInputCols("document")
  .setOutputCol("language")

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    languageDetector
  ))

val data = Seq(
  "Spark NLP is an open-source text processing library for advanced natural language processing for the Python, Java and Scala programming languages.",
  "Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala.",
  "Spark NLP ist eine Open-Source-Textverarbeitungsbibliothek für fortgeschrittene natürliche Sprachverarbeitung für die Programmiersprachen Python, Java und Scala."
).toDF("text")
val result = pipeline.fit(data).transform(data)

result.select("language.result").show(false)
+------+
|result|
+------+
|[en]  |
|[fr]  |
|[de]  |
+------+

Lemmatizer

Class to find lemmas out of words with the objective of returning a base dictionary word. Retrieves the significant part of a word. A dictionary of predefined lemmas must be provided with setDictionary. The dictionary can be set as a delimited text file. Pretrained models can be loaded with LemmatizerModel.pretrained.

For available pretrained models please see the Models Hub. For extended examples of usage, see the Examples.

Input Annotator Types: TOKEN

Output Annotator Type: TOKEN

Python API: Lemmatizer Scala API: Lemmatizer Source: Lemmatizer
Show Example
# In this example, the lemma dictionary `lemmas_small.txt` has the form of
#
# ...
# pick	->	pick	picks	picking	picked
# peck	->	peck	pecking	pecked	pecks
# pickle	->	pickle	pickles	pickled	pickling
# pepper	->	pepper	peppers	peppered	peppering
# ...
#
# where each key is delimited by `->` and values are delimited by `\t`

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

lemmatizer = Lemmatizer() \
    .setInputCols(["token"]) \
    .setOutputCol("lemma") \
    .setDictionary("src/test/resources/lemma-corpus-small/lemmas_small.txt", "->", "\t")

pipeline = Pipeline() \
    .setStages([
      documentAssembler,
      sentenceDetector,
      tokenizer,
      lemmatizer
    ])

data = spark.createDataFrame([["Peter Pipers employees are picking pecks of pickled peppers."]]) \
    .toDF("text")

result = pipeline.fit(data).transform(data)
result.selectExpr("lemma.result").show(truncate=False)
+------------------------------------------------------------------+
|result                                                            |
+------------------------------------------------------------------+
|[Peter, Pipers, employees, are, pick, peck, of, pickle, pepper, .]|
+------------------------------------------------------------------+
// In this example, the lemma dictionary `lemmas_small.txt` has the form of
//
// ...
// pick	->	pick	picks	picking	picked
// peck	->	peck	pecking	pecked	pecks
// pickle	->	pickle	pickles	pickled	pickling
// pepper	->	pepper	peppers	peppered	peppering
// ...
//
// where each key is delimited by `->` and values are delimited by `\t`
import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.Tokenizer
import com.johnsnowlabs.nlp.annotator.SentenceDetector
import com.johnsnowlabs.nlp.annotators.Lemmatizer
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val lemmatizer = new Lemmatizer()
  .setInputCols(Array("token"))
  .setOutputCol("lemma")
  .setDictionary("src/test/resources/lemma-corpus-small/lemmas_small.txt", "->", "\t")

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    sentenceDetector,
    tokenizer,
    lemmatizer
  ))

val data = Seq("Peter Pipers employees are picking pecks of pickled peppers.")
  .toDF("text")

val result = pipeline.fit(data).transform(data)
result.selectExpr("lemma.result").show(false)
+------------------------------------------------------------------+
|result                                                            |
+------------------------------------------------------------------+
|[Peter, Pipers, employees, are, pick, peck, of, pickle, pepper, .]|
+------------------------------------------------------------------+

MultiClassifierDL

Trains a MultiClassifierDL for Multi-label Text Classification.

MultiClassifierDL uses a Bidirectional GRU with a convolutional model that we have built inside TensorFlow and supports up to 100 classes.

For instantiated/pretrained models, see MultiClassifierDLModel.

The input to MultiClassifierDL are Sentence Embeddings such as the state-of-the-art UniversalSentenceEncoder, BertSentenceEmbeddings or SentenceEmbeddings.

In machine learning, multi-label classification and the strongly related problem of multi-output classification are variants of the classification problem where multiple labels may be assigned to each instance. Multi-label classification is a generalization of multiclass classification, which is the single-label problem of categorizing instances into precisely one of more than two classes; in the multi-label problem there is no constraint on how many of the classes the instance can be assigned to. Formally, multi-label classification is the problem of finding a model that maps inputs x to binary vectors y (assigning a value of 0 or 1 for each element (label) in y).

Setting a test dataset to monitor model metrics can be done with .setTestDataset. The method expects a path to a parquet file containing a dataframe that has the same required columns as the training dataframe. The pre-processing steps for the training dataframe should also be applied to the test dataframe. The following example will show how to create the test dataset:

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val embeddings = UniversalSentenceEncoder.pretrained()
  .setInputCols("document")
  .setOutputCol("sentence_embeddings")

val preProcessingPipeline = new Pipeline().setStages(Array(documentAssembler, embeddings))

val Array(train, test) = data.randomSplit(Array(0.8, 0.2))
preProcessingPipeline
  .fit(test)
  .transform(test)
  .write
  .mode("overwrite")
  .parquet("test_data")

val multiClassifier = new MultiClassifierDLApproach()
  .setInputCols("sentence_embeddings")
  .setOutputCol("category")
  .setLabelColumn("label")
  .setTestDataset("test_data")

For extended examples of usage, see the Examples and the MultiClassifierDLTestSpec.

Input Annotator Types: SENTENCE_EMBEDDINGS

Output Annotator Type: CATEGORY

Note: This annotator accepts a label column of a single item in either type of String, Int, Float, or Double. UniversalSentenceEncoder, BertSentenceEmbeddings, SentenceEmbeddings or other sentence based embeddings can be used for the inputCol

Python API: MultiClassifierDLApproach Scala API: MultiClassifierDLApproach Source: MultiClassifierDLApproach
Show Example
# In this example, the training data has the form
#
# +----------------+--------------------+--------------------+
# |              id|                text|              labels|
# +----------------+--------------------+--------------------+
# |ed58abb40640f983|PN NewsYou mean ... |             [toxic]|
# |a1237f726b5f5d89|Dude.  Place the ...|   [obscene, insult]|
# |24b0d6c8733c2abe|Thanks  - thanks ...|            [insult]|
# |8c4478fb239bcfc0|" Gee, 5 minutes ...|[toxic, obscene, ...|
# +----------------+--------------------+--------------------+

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

# Process training data to create text with associated array of labels

trainDataset.printSchema()
# root
#  |-- id: string (nullable = true)
#  |-- text: string (nullable = true)
#  |-- labels: array (nullable = true)
#  |    |-- element: string (containsNull = true)


# Then create pipeline for training
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document") \
    .setCleanupMode("shrink")

embeddings = UniversalSentenceEncoder.pretrained() \
    .setInputCols("document") \
    .setOutputCol("embeddings")

docClassifier = MultiClassifierDLApproach() \
    .setInputCols("embeddings") \
    .setOutputCol("category") \
    .setLabelColumn("labels") \
    .setBatchSize(128) \
    .setMaxEpochs(10) \
    .setLr(1e-3) \
    .setThreshold(0.5) \
    .setValidationSplit(0.1)

pipeline = Pipeline() \
    .setStages(
      [
        documentAssembler,
        embeddings,
        docClassifier
      ]
    )

pipelineModel = pipeline.fit(trainDataset)
// In this example, the training data has the form (Note: labels can be arbitrary)
//
// mr,ref
// "name[Alimentum], area[city centre], familyFriendly[no], near[Burger King]",Alimentum is an adult establish found in the city centre area near Burger King.
// "name[Alimentum], area[city centre], familyFriendly[yes]",Alimentum is a family-friendly place in the city centre.
// ...
//
// It needs some pre-processing first, so the labels are of type `Array[String]`. This can be done like so:

import spark.implicits._
import com.johnsnowlabs.nlp.annotators.classifier.dl.MultiClassifierDLApproach
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.embeddings.UniversalSentenceEncoder
import org.apache.spark.ml.Pipeline
import org.apache.spark.sql.functions.{col, udf}

// Process training data to create text with associated array of labels
def splitAndTrim = udf { labels: String =>
  labels.split(", ").map(x=>x.trim)
}

val smallCorpus = spark.read
  .option("header", true)
  .option("inferSchema", true)
  .option("mode", "DROPMALFORMED")
  .csv("src/test/resources/classifier/e2e.csv")
  .withColumn("labels", splitAndTrim(col("mr")))
  .withColumn("text", col("ref"))
  .drop("mr")

smallCorpus.printSchema()
// root
// |-- ref: string (nullable = true)
// |-- labels: array (nullable = true)
// |    |-- element: string (containsNull = true)

// Then create pipeline for training
val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")
  .setCleanupMode("shrink")

val embeddings = UniversalSentenceEncoder.pretrained()
  .setInputCols("document")
  .setOutputCol("embeddings")

val docClassifier = new MultiClassifierDLApproach()
  .setInputCols("embeddings")
  .setOutputCol("category")
  .setLabelColumn("labels")
  .setBatchSize(128)
  .setMaxEpochs(10)
  .setLr(1e-3f)
  .setThreshold(0.5f)
  .setValidationSplit(0.1f)

val pipeline = new Pipeline()
  .setStages(
    Array(
      documentAssembler,
      embeddings,
      docClassifier
    )
  )

val pipelineModel = pipeline.fit(smallCorpus)

MultiDateMatcher

Matches standard date formats into a provided format.

Reads the following kind of dates:

"1978-01-28", "1984/04/02,1/02/1980", "2/28/79", "The 31st of April in the year 2008",
"Fri, 21 Nov 1997", "Jan 21, ‘97", "Sun", "Nov 21", "jan 1st", "next thursday",
"last wednesday", "today", "tomorrow", "yesterday", "next week", "next month",
"next year", "day after", "the day before", "0600h", "06:00 hours", "6pm", "5:30 a.m.",
"at 5", "12:59", "23:59", "1988/11/23 6pm", "next week at 7.30", "5 am tomorrow"

For example "The 31st of April in the year 2008" will be converted into 2008/04/31.

For extended examples of usage, see the Examples and the MultiDateMatcherTestSpec.

Input Annotator Types: DOCUMENT

Output Annotator Type: DATE

Python API: MultiDateMatcher Scala API: MultiDateMatcher Source: MultiDateMatcher
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

date = MultiDateMatcher() \
    .setInputCols("document") \
    .setOutputCol("date") \
    .setAnchorDateYear(2020) \
    .setAnchorDateMonth(1) \
    .setAnchorDateDay(11) \
    .setDateFormat("yyyy/MM/dd")

pipeline = Pipeline().setStages([
    documentAssembler,
    date
])

data = spark.createDataFrame([["I saw him yesterday and he told me that he will visit us next week"]]) \
    .toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("explode(date) as dates").show(truncate=False)
+-----------------------------------------------+
|dates                                          |
+-----------------------------------------------+
|[date, 57, 65, 2020/01/18, [sentence -> 0], []]|
|[date, 10, 18, 2020/01/10, [sentence -> 0], []]|
+-----------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.MultiDateMatcher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val date = new MultiDateMatcher()
  .setInputCols("document")
  .setOutputCol("date")
  .setAnchorDateYear(2020)
  .setAnchorDateMonth(1)
  .setAnchorDateDay(11)
  .setDateFormat("yyyy/MM/dd")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  date
))

val data = Seq("I saw him yesterday and he told me that he will visit us next week")
  .toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(date) as dates").show(false)
+-----------------------------------------------+
|dates                                          |
+-----------------------------------------------+
|[date, 57, 65, 2020/01/18, [sentence -> 0], []]|
|[date, 10, 18, 2020/01/10, [sentence -> 0], []]|
+-----------------------------------------------+

MultiDocumentAssembler

Prepares data into a format that is processable by Spark NLP. This is the entry point for every Spark NLP pipeline. The MultiDocumentAssembler can read either a String column or an Array[String]. Additionally, MultiDocumentAssembler.setCleanupMode can be used to pre-process the text (Default: disabled). For possible options please refer the parameters section.

For more extended examples on document pre-processing see the Examples.

Input Annotator Types: NONE

Output Annotator Type: DOCUMENT

Python API: MultiDocumentAssembler Scala API: MultiDocumentAssembler Source: MultiDocumentAssembler
Show Example
import sparknlp

from sparknlp.base import *

from pyspark.ml import Pipeline

data = spark.createDataFrame([["Spark NLP is an open-source text processing library."], ["Spark NLP is a state-of-the-art Natural Language Processing library built on top of Apache Spark"]]).toDF("text", "text2")

documentAssembler = MultiDocumentAssembler().setInputCols(["text", "text2"]).setOutputCols(["document1", "document2"])

result = documentAssembler.transform(data)

result.select("document1").show(truncate=False)
+----------------------------------------------------------------------------------------------+
|document1                                                                                      |
+----------------------------------------------------------------------------------------------+
|[[document, 0, 51, Spark NLP is an open-source text processing library., [sentence -> 0], []]]|
+----------------------------------------------------------------------------------------------+

result.select("document1").printSchema()
root
|-- document: array (nullable = True)
|    |-- element: struct (containsNull = True)
|    |    |-- annotatorType: string (nullable = True)
|    |    |-- begin: integer (nullable = False)
|    |    |-- end: integer (nullable = False)
|    |    |-- result: string (nullable = True)
|    |    |-- metadata: map (nullable = True)
|    |    |    |-- key: string
|    |    |    |-- value: string (valueContainsNull = True)
|    |    |-- embeddings: array (nullable = True)
|    |    |    |-- element: float (containsNull = False)
import spark.implicits._
import com.johnsnowlabs.nlp.MultiDocumentAssembler

val data = Seq("Spark NLP is an open-source text processing library.").toDF("text")
val multiDocumentAssembler = new MultiDocumentAssembler().setInputCols("text").setOutputCols("document")

val result = multiDocumentAssembler.transform(data)

result.select("document").show(false)
+----------------------------------------------------------------------------------------------+
|document                                                                                      |
+----------------------------------------------------------------------------------------------+
|[[document, 0, 51, Spark NLP is an open-source text processing library., [sentence -> 0], []]]|
+----------------------------------------------------------------------------------------------+

result.select("document").printSchema
root
 |-- document: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)

NGramGenerator

A feature transformer that converts the input array of strings (annotatorType TOKEN) into an array of n-grams (annotatorType CHUNK). Null values in the input array are ignored. It returns an array of n-grams where each n-gram is represented by a space-separated string of words.

When the input is empty, an empty array is returned. When the input array length is less than n (number of elements per n-gram), no n-grams are returned.

For more extended examples see the Examples and the NGramGeneratorTestSpec.

Input Annotator Types: TOKEN

Output Annotator Type: CHUNK

Python API: NGramGenerator Scala API: NGramGenerator Source: NGramGenerator
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

nGrams = NGramGenerator() \
    .setInputCols(["token"]) \
    .setOutputCol("ngrams") \
    .setN(2)

pipeline = Pipeline().setStages([
      documentAssembler,
      sentence,
      tokenizer,
      nGrams
    ])

data = spark.createDataFrame([["This is my sentence."]]).toDF("text")
results = pipeline.fit(data).transform(data)

results.selectExpr("explode(ngrams) as result").show(truncate=False)
+------------------------------------------------------------+
|result                                                      |
+------------------------------------------------------------+
|[chunk, 0, 6, This is, [sentence -> 0, chunk -> 0], []]     |
|[chunk, 5, 9, is my, [sentence -> 0, chunk -> 1], []]       |
|[chunk, 8, 18, my sentence, [sentence -> 0, chunk -> 2], []]|
|[chunk, 11, 19, sentence ., [sentence -> 0, chunk -> 3], []]|
+------------------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.SentenceDetector
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.NGramGenerator
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentence = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val nGrams = new NGramGenerator()
  .setInputCols("token")
  .setOutputCol("ngrams")
  .setN(2)

val pipeline = new Pipeline().setStages(Array(
    documentAssembler,
    sentence,
    tokenizer,
    nGrams
  ))

val data = Seq("This is my sentence.").toDF("text")
val results = pipeline.fit(data).transform(data)

results.selectExpr("explode(ngrams) as result").show(false)
+------------------------------------------------------------+
|result                                                      |
+------------------------------------------------------------+
|[chunk, 0, 6, This is, [sentence -> 0, chunk -> 0], []]     |
|[chunk, 5, 9, is my, [sentence -> 0, chunk -> 1], []]       |
|[chunk, 8, 18, my sentence, [sentence -> 0, chunk -> 2], []]|
|[chunk, 11, 19, sentence ., [sentence -> 0, chunk -> 3], []]|
+------------------------------------------------------------+

NerConverter

Converts a IOB or IOB2 representation of NER to a user-friendly one, by associating the tokens of recognized entities and their label. Results in CHUNK Annotation type.

NER chunks can then be filtered by setting a whitelist with setWhiteList. Chunks with no associated entity (tagged “O”) are filtered.

See also Inside–outside–beginning (tagging) for more information.

Input Annotator Types: DOCUMENT, TOKEN, NAMED_ENTITY

Output Annotator Type: CHUNK

Python API: NerConverter Scala API: NerConverter Source: NerConverter
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
# This is a continuation of the example of the NerDLModel. See that class
# on how to extract the entities.
# The output of the NerDLModel follows the Annotator schema and can be converted like so:
#
# result.selectExpr("explode(ner)").show(truncate=False)
# +----------------------------------------------------+
# |col                                                 |
# +----------------------------------------------------+
# |[named_entity, 0, 2, B-ORG, [word -> U.N], []]      |
# |[named_entity, 3, 3, O, [word -> .], []]            |
# |[named_entity, 5, 12, O, [word -> official], []]    |
# |[named_entity, 14, 18, B-PER, [word -> Ekeus], []]  |
# |[named_entity, 20, 24, O, [word -> heads], []]      |
# |[named_entity, 26, 28, O, [word -> for], []]        |
# |[named_entity, 30, 36, B-LOC, [word -> Baghdad], []]|
# |[named_entity, 37, 37, O, [word -> .], []]          |
# +----------------------------------------------------+
#
# After the converter is used:
converter = NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("entities")

converter.transform(result).selectExpr("explode(entities)").show(truncate=False)
+------------------------------------------------------------------------+
|col                                                                     |
+------------------------------------------------------------------------+
|[chunk, 0, 2, U.N, [entity -> ORG, sentence -> 0, chunk -> 0], []]      |
|[chunk, 14, 18, Ekeus, [entity -> PER, sentence -> 0, chunk -> 1], []]  |
|[chunk, 30, 36, Baghdad, [entity -> LOC, sentence -> 0, chunk -> 2], []]|
+------------------------------------------------------------------------+
// This is a continuation of the example of the [[com.johnsnowlabs.nlp.annotators.ner.dl.NerDLModel NerDLModel]]. See that class
// on how to extract the entities.
// The output of the NerDLModel follows the Annotator schema and can be converted like so:
//
// result.selectExpr("explode(ner)").show(false)
// +----------------------------------------------------+
// |col                                                 |
// +----------------------------------------------------+
// |[named_entity, 0, 2, B-ORG, [word -> U.N], []]      |
// |[named_entity, 3, 3, O, [word -> .], []]            |
// |[named_entity, 5, 12, O, [word -> official], []]    |
// |[named_entity, 14, 18, B-PER, [word -> Ekeus], []]  |
// |[named_entity, 20, 24, O, [word -> heads], []]      |
// |[named_entity, 26, 28, O, [word -> for], []]        |
// |[named_entity, 30, 36, B-LOC, [word -> Baghdad], []]|
// |[named_entity, 37, 37, O, [word -> .], []]          |
// +----------------------------------------------------+
//
// After the converter is used:
val converter = new NerConverter()
  .setInputCols("sentence", "token", "ner")
  .setOutputCol("entities")
  .setPreservePosition(false)

converter.transform(result).selectExpr("explode(entities)").show(false)
+------------------------------------------------------------------------+
|col                                                                     |
+------------------------------------------------------------------------+
|[chunk, 0, 2, U.N, [entity -> ORG, sentence -> 0, chunk -> 0], []]      |
|[chunk, 14, 18, Ekeus, [entity -> PER, sentence -> 0, chunk -> 1], []]  |
|[chunk, 30, 36, Baghdad, [entity -> LOC, sentence -> 0, chunk -> 2], []]|
+------------------------------------------------------------------------+

NerCrf

Algorithm for training a Named Entity Recognition Model

For instantiated/pretrained models, see NerCrfModel.

This Named Entity recognition annotator allows for a generic model to be trained by utilizing a CRF machine learning algorithm. The training data should be a labeled Spark Dataset, e.g. CoNLL 2003 IOB with Annotation type columns. The data should have columns of type DOCUMENT, TOKEN, POS, WORD_EMBEDDINGS and an additional label column of annotator type NAMED_ENTITY. Excluding the label, this can be done with for example

Optionally the user can provide an entity dictionary file with setExternalFeatures for better accuracy.

For extended examples of usage, see the Examples and the NerCrfApproachTestSpec.

Input Annotator Types: DOCUMENT, TOKEN, POS, WORD_EMBEDDINGS

Output Annotator Type: NAMED_ENTITY

Python API: NerCrfApproach Scala API: NerCrfApproach Source: NerCrfApproach
Show Example
# This CoNLL dataset already includes the sentence, token, pos and label column with their respective annotator types.
# If a custom dataset is used, these need to be defined.

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

embeddings = WordEmbeddingsModel.pretrained() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings") \
    .setCaseSensitive(False)

nerTagger = NerCrfApproach() \
    .setInputCols(["sentence", "token", "pos", "embeddings"]) \
    .setLabelColumn("label") \
    .setMinEpochs(1) \
    .setMaxEpochs(3) \
    .setC0(34) \
    .setL2(3.0) \
    .setOutputCol("ner")

pipeline = Pipeline().setStages([
    documentAssembler,
    embeddings,
    nerTagger
])


conll = CoNLL()
trainingData = conll.readDataset(spark, "src/test/resources/conll2003/eng.train")

pipelineModel = pipeline.fit(trainingData)
// This CoNLL dataset already includes the sentence, token, pos and label column with their respective annotator types.
// If a custom dataset is used, these need to be defined.

import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel
import com.johnsnowlabs.nlp.annotator.NerCrfApproach
import com.johnsnowlabs.nlp.training.CoNLL
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val embeddings = WordEmbeddingsModel.pretrained()
  .setInputCols("sentence", "token")
  .setOutputCol("embeddings")
  .setCaseSensitive(false)

val nerTagger = new NerCrfApproach()
  .setInputCols("sentence", "token", "pos", "embeddings")
  .setLabelColumn("label")
  .setMinEpochs(1)
  .setMaxEpochs(3)
  .setC0(34)
  .setL2(3.0)
  .setOutputCol("ner")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  embeddings,
  nerTagger
))


val conll = CoNLL()
val trainingData = conll.readDataset(spark, "src/test/resources/conll2003/eng.train")

val pipelineModel = pipeline.fit(trainingData)

NerDL

This Named Entity recognition annotator allows to train generic NER model based on Neural Networks.

The architecture of the neural network is a Char CNNs - BiLSTM - CRF that achieves state-of-the-art in most datasets.

For instantiated/pretrained models, see NerDLModel.

The training data should be a labeled Spark Dataset, in the format of CoNLL 2003 IOB with Annotation type columns. The data should have columns of type DOCUMENT, TOKEN, WORD_EMBEDDINGS and an additional label column of annotator type NAMED_ENTITY. Excluding the label, this can be done with for example

Setting a test dataset to monitor model metrics can be done with .setTestDataset. The method expects a path to a parquet file containing a dataframe that has the same required columns as the training dataframe. The pre-processing steps for the training dataframe should also be applied to the test dataframe. The following example will show how to create the test dataset with a CoNLL dataset:

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val embeddings = WordEmbeddingsModel
  .pretrained()
  .setInputCols("document", "token")
  .setOutputCol("embeddings")

val preProcessingPipeline = new Pipeline().setStages(Array(documentAssembler, embeddings))

val conll = CoNLL()
val Array(train, test) = conll
  .readDataset(spark, "src/test/resources/conll2003/eng.train")
  .randomSplit(Array(0.8, 0.2))

preProcessingPipeline
  .fit(test)
  .transform(test)
  .write
  .mode("overwrite")
  .parquet("test_data")

val nerTagger = new NerDLApproach()
  .setInputCols("document", "token", "embeddings")
  .setLabelColumn("label")
  .setOutputCol("ner")
  .setTestDataset("test_data")

For extended examples of usage, see the Examples and the NerDLSpec.

Input Annotator Types: DOCUMENT, TOKEN, WORD_EMBEDDINGS

Output Annotator Type: NAMED_ENTITY

Python API: NerDLApproach Scala API: NerDLApproach Source: NerDLApproach
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

# First extract the prerequisites for the NerDLApproach
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

embeddings = BertEmbeddings.pretrained() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

# Then the training can start
nerTagger = NerDLApproach() \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setLabelColumn("label") \
    .setOutputCol("ner") \
    .setMaxEpochs(1) \
    .setRandomSeed(0) \
    .setVerbose(0)

pipeline = Pipeline().setStages([
    documentAssembler,
    sentence,
    tokenizer,
    embeddings,
    nerTagger
])

# We use the text and labels from the CoNLL dataset
conll = CoNLL()
trainingData = conll.readDataset(spark, "src/test/resources/conll2003/eng.train")

pipelineModel = pipeline.fit(trainingData)
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
import com.johnsnowlabs.nlp.embeddings.BertEmbeddings
import com.johnsnowlabs.nlp.annotators.ner.dl.NerDLApproach
import com.johnsnowlabs.nlp.training.CoNLL
import org.apache.spark.ml.Pipeline

// First extract the prerequisites for the NerDLApproach
val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentence = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

val embeddings = BertEmbeddings.pretrained()
  .setInputCols("sentence", "token")
  .setOutputCol("embeddings")

// Then the training can start
val nerTagger = new NerDLApproach()
.setInputCols("sentence", "token", "embeddings")
.setLabelColumn("label")
.setOutputCol("ner")
.setMaxEpochs(1)
.setRandomSeed(0)
.setVerbose(0)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentence,
  tokenizer,
  embeddings,
  nerTagger
))

// We use the text and labels from the CoNLL dataset
val conll = CoNLL()
val trainingData = conll.readDataset(spark, "src/test/resources/conll2003/eng.train")

val pipelineModel = pipeline.fit(trainingData)

NerOverwriter

Overwrites entities of specified strings.

The input for this Annotator have to be entities that are already extracted, Annotator type NAMED_ENTITY. The strings specified with setStopWords will have new entities assigned to, specified with setNewResult.

Input Annotator Types: NAMED_ENTITY

Output Annotator Type: NAMED_ENTITY

Python API: NerOverwriter Scala API: NerOverwriter Source: NerOverwriter
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

# First extract the prerequisite Entities
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

embeddings = WordEmbeddingsModel.pretrained() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("bert")

nerTagger = NerDLModel.pretrained() \
    .setInputCols(["sentence", "token", "bert"]) \
    .setOutputCol("ner")

pipeline = Pipeline().setStages([
    documentAssembler,
    sentence,
    tokenizer,
    embeddings,
    nerTagger
])

data = spark.createDataFrame([["Spark NLP Crosses Five Million Downloads, John Snow Labs Announces."]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("explode(ner)").show(truncate=False)
# +------------------------------------------------------+
# |col                                                   |
# +------------------------------------------------------+
# |[named_entity, 0, 4, B-ORG, [word -> Spark], []]      |
# |[named_entity, 6, 8, I-ORG, [word -> NLP], []]        |
# |[named_entity, 10, 16, O, [word -> Crosses], []]      |
# |[named_entity, 18, 21, O, [word -> Five], []]         |
# |[named_entity, 23, 29, O, [word -> Million], []]      |
# |[named_entity, 31, 39, O, [word -> Downloads], []]    |
# |[named_entity, 40, 40, O, [word -> ,], []]            |
# |[named_entity, 42, 45, B-ORG, [word -> John], []]     |
# |[named_entity, 47, 50, I-ORG, [word -> Snow], []]     |
# |[named_entity, 52, 55, I-ORG, [word -> Labs], []]     |
# |[named_entity, 57, 65, I-ORG, [word -> Announces], []]|
# |[named_entity, 66, 66, O, [word -> .], []]            |
# +------------------------------------------------------+

# The recognized entities can then be overwritten
nerOverwriter = NerOverwriter() \
    .setInputCols(["ner"]) \
    .setOutputCol("ner_overwritten") \
    .setStopWords(["Million"]) \
    .setNewResult("B-CARDINAL")

nerOverwriter.transform(result).selectExpr("explode(ner_overwritten)").show(truncate=False)
+---------------------------------------------------------+
|col                                                      |
+---------------------------------------------------------+
|[named_entity, 0, 4, B-ORG, [word -> Spark], []]         |
|[named_entity, 6, 8, I-ORG, [word -> NLP], []]           |
|[named_entity, 10, 16, O, [word -> Crosses], []]         |
|[named_entity, 18, 21, O, [word -> Five], []]            |
|[named_entity, 23, 29, B-CARDINAL, [word -> Million], []]|
|[named_entity, 31, 39, O, [word -> Downloads], []]       |
|[named_entity, 40, 40, O, [word -> ,], []]               |
|[named_entity, 42, 45, B-ORG, [word -> John], []]        |
|[named_entity, 47, 50, I-ORG, [word -> Snow], []]        |
|[named_entity, 52, 55, I-ORG, [word -> Labs], []]        |
|[named_entity, 57, 65, I-ORG, [word -> Announces], []]   |
|[named_entity, 66, 66, O, [word -> .], []]               |
+---------------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
import com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel
import com.johnsnowlabs.nlp.annotators.ner.dl.NerDLModel
import com.johnsnowlabs.nlp.annotators.ner.NerOverwriter
import org.apache.spark.ml.Pipeline

// First extract the prerequisite Entities
val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentence = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

val embeddings = WordEmbeddingsModel.pretrained()
  .setInputCols("sentence", "token")
  .setOutputCol("bert")

val nerTagger = NerDLModel.pretrained()
  .setInputCols("sentence", "token", "bert")
  .setOutputCol("ner")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentence,
  tokenizer,
  embeddings,
  nerTagger
))

val data = Seq("Spark NLP Crosses Five Million Downloads, John Snow Labs Announces.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(ner)").show(false)
/
+------------------------------------------------------+
|col                                                   |
+------------------------------------------------------+
|[named_entity, 0, 4, B-ORG, [word -> Spark], []]      |
|[named_entity, 6, 8, I-ORG, [word -> NLP], []]        |
|[named_entity, 10, 16, O, [word -> Crosses], []]      |
|[named_entity, 18, 21, O, [word -> Five], []]         |
|[named_entity, 23, 29, O, [word -> Million], []]      |
|[named_entity, 31, 39, O, [word -> Downloads], []]    |
|[named_entity, 40, 40, O, [word -> ,], []]            |
|[named_entity, 42, 45, B-ORG, [word -> John], []]     |
|[named_entity, 47, 50, I-ORG, [word -> Snow], []]     |
|[named_entity, 52, 55, I-ORG, [word -> Labs], []]     |
|[named_entity, 57, 65, I-ORG, [word -> Announces], []]|
|[named_entity, 66, 66, O, [word -> .], []]            |
+------------------------------------------------------+
/
// The recognized entities can then be overwritten
val nerOverwriter = new NerOverwriter()
  .setInputCols("ner")
  .setOutputCol("ner_overwritten")
  .setStopWords(Array("Million"))
  .setNewResult("B-CARDINAL")

nerOverwriter.transform(result).selectExpr("explode(ner_overwritten)").show(false)
+---------------------------------------------------------+
|col                                                      |
+---------------------------------------------------------+
|[named_entity, 0, 4, B-ORG, [word -> Spark], []]         |
|[named_entity, 6, 8, I-ORG, [word -> NLP], []]           |
|[named_entity, 10, 16, O, [word -> Crosses], []]         |
|[named_entity, 18, 21, O, [word -> Five], []]            |
|[named_entity, 23, 29, B-CARDINAL, [word -> Million], []]|
|[named_entity, 31, 39, O, [word -> Downloads], []]       |
|[named_entity, 40, 40, O, [word -> ,], []]               |
|[named_entity, 42, 45, B-ORG, [word -> John], []]        |
|[named_entity, 47, 50, I-ORG, [word -> Snow], []]        |
|[named_entity, 52, 55, I-ORG, [word -> Labs], []]        |
|[named_entity, 57, 65, I-ORG, [word -> Announces], []]   |
|[named_entity, 66, 66, O, [word -> .], []]               |
+---------------------------------------------------------+

Normalizer

Annotator that cleans out tokens. Requires stems, hence tokens. Removes all dirty characters from text following a regex pattern and transforms words based on a provided dictionary

For extended examples of usage, see the Examples.

Input Annotator Types: TOKEN

Output Annotator Type: TOKEN

Python API: Normalizer Scala API: Normalizer Source: Normalizer
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

normalizer = Normalizer() \
    .setInputCols(["token"]) \
    .setOutputCol("normalized") \
    .setLowercase(True) \
    .setCleanupPatterns(["""[^\w\d\s]"""]) # remove punctuations (keep alphanumeric chars)
# if we don't set CleanupPatterns, it will only keep alphabet letters ([^A-Za-z])

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    normalizer
])

data = spark.createDataFrame([["John and Peter are brothers. However they don't support each other that much."]]) \
    .toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("normalized.result").show(truncate = False)
+----------------------------------------------------------------------------------------+
|result                                                                                  |
+----------------------------------------------------------------------------------------+
|[john, and, peter, are, brothers, however, they, dont, support, each, other, that, much]|
+----------------------------------------------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.{Normalizer, Tokenizer}
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val normalizer = new Normalizer()
  .setInputCols("token")
  .setOutputCol("normalized")
  .setLowercase(true)
  .setCleanupPatterns(Array("""[^\w\d\s]""")) // remove punctuations (keep alphanumeric chars)
// if we don't set CleanupPatterns, it will only keep alphabet letters ([^A-Za-z])

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  tokenizer,
  normalizer
))

val data = Seq("John and Peter are brothers. However they don't support each other that much.")
  .toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("normalized.result").show(truncate = false)
+----------------------------------------------------------------------------------------+
|result                                                                                  |
+----------------------------------------------------------------------------------------+
|[john, and, peter, are, brothers, however, they, dont, support, each, other, that, much]|
+----------------------------------------------------------------------------------------+

NorvigSweeting Spellchecker

Trains annotator, that retrieves tokens and makes corrections automatically if not found in an English dictionary.

The Symmetric Delete spelling correction algorithm reduces the complexity of edit candidate generation and dictionary lookup for a given Damerau-Levenshtein distance. It is six orders of magnitude faster (than the standard approach with deletes + transposes + replaces + inserts) and language independent. A dictionary of correct spellings must be provided with setDictionary as a text file, where each word is parsed by a regex pattern.

Inspired by Norvig model and SymSpell.

For instantiated/pretrained models, see NorvigSweetingModel.

For extended examples of usage, see the NorvigSweetingTestSpec.

Input Annotator Types: TOKEN

Output Annotator Type: TOKEN

Python API: NorvigSweetingApproach Scala API: NorvigSweetingApproach Source: NorvigSweetingApproach
Show Example
# In this example, the dictionary `"words.txt"` has the form of
#
# ...
# gummy
# gummic
# gummier
# gummiest
# gummiferous
# ...
#
# This dictionary is then set to be the basis of the spell checker.

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

spellChecker = NorvigSweetingApproach() \
    .setInputCols(["token"]) \
    .setOutputCol("spell") \
    .setDictionary("src/test/resources/spell/words.txt")

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    spellChecker
])

pipelineModel = pipeline.fit(trainingData)
// In this example, the dictionary `"words.txt"` has the form of
//
// ...
// gummy
// gummic
// gummier
// gummiest
// gummiferous
// ...
//
// This dictionary is then set to be the basis of the spell checker.

import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.spell.norvig.NorvigSweetingApproach
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val spellChecker = new NorvigSweetingApproach()
  .setInputCols("token")
  .setOutputCol("spell")
  .setDictionary("src/test/resources/spell/words.txt")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  tokenizer,
  spellChecker
))

val pipelineModel = pipeline.fit(trainingData)

POSTagger (Part of speech tagger)

Trains an averaged Perceptron model to tag words part-of-speech. Sets a POS tag to each word within a sentence.

For pretrained models please see the PerceptronModel.

The training data needs to be in a Spark DataFrame, where the column needs to consist of Annotations of type POS. The Annotation needs to have member result set to the POS tag and have a "word" mapping to its word inside of member metadata. This DataFrame for training can easily created by the helper class POS.

POS().readDataset(spark, datasetPath).selectExpr("explode(tags) as tags").show(false)
+---------------------------------------------+
|tags                                         |
+---------------------------------------------+
|[pos, 0, 5, NNP, [word -> Pierre], []]       |
|[pos, 7, 12, NNP, [word -> Vinken], []]      |
|[pos, 14, 14, ,, [word -> ,], []]            |
|[pos, 31, 34, MD, [word -> will], []]        |
|[pos, 36, 39, VB, [word -> join], []]        |
|[pos, 41, 43, DT, [word -> the], []]         |
|[pos, 45, 49, NN, [word -> board], []]       |
                      ...

For extended examples of usage, see the Examples and PerceptronApproach tests.

Input Annotator Types: TOKEN, DOCUMENT

Output Annotator Type: POS

Python API: PerceptronApproach Scala API: PerceptronApproach Source: PerceptronApproach
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

datasetPath = "src/test/resources/anc-pos-corpus-small/test-training.txt"
trainingPerceptronDF = POS().readDataset(spark, datasetPath)

trainedPos = PerceptronApproach() \
    .setInputCols(["document", "token"]) \
    .setOutputCol("pos") \
    .setPosColumn("tags") \
    .fit(trainingPerceptronDF)

pipeline = Pipeline().setStages([
    documentAssembler,
    sentence,
    tokenizer,
    trainedPos
])

data = spark.createDataFrame([["To be or not to be, is this the question?"]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("pos.result").show(truncate=False)
+--------------------------------------------------+
|result                                            |
+--------------------------------------------------+
|[NNP, NNP, CD, JJ, NNP, NNP, ,, MD, VB, DT, CD, .]|
+--------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.SentenceDetector
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.training.POS
import com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronApproach
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentence = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

val datasetPath = "src/test/resources/anc-pos-corpus-small/test-training.txt"
val trainingPerceptronDF = POS().readDataset(spark, datasetPath)

val trainedPos = new PerceptronApproach()
  .setInputCols("document", "token")
  .setOutputCol("pos")
  .setPosColumn("tags")
  .fit(trainingPerceptronDF)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentence,
  tokenizer,
  trainedPos
))

val data = Seq("To be or not to be, is this the question?").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("pos.result").show(false)
+--------------------------------------------------+
|result                                            |
+--------------------------------------------------+
|[NNP, NNP, CD, JJ, NNP, NNP, ,, MD, VB, DT, CD, .]|
+--------------------------------------------------+

PromptAssembler

Assembles a sequence of messages into a single string using a template. These strings can then be used as prompts for large language models.

This annotator expects an array of two-tuples as the type of the input column (one array of tuples per row). The first element of the tuples should be the role and the second element is the text of the message. Possible roles are “system”, “user” and “assistant”.

An assistant header can be added to the end of the generated string by using setAddAssistant(true).

At the moment, this annotator uses llama.cpp as a backend to parse and apply the templates. llama.cpp uses basic pattern matching to determine the type of the template, then applies a basic version of the template to the messages. This means that more advanced templates are not supported.

For an extended example see the example notebook.

Input Annotator Types: NONE

Output Annotator Type: DOCUMENT

Python API: PromptAssembler Scala API: PromptAssembler Source: PromptAssembler
Show Example
from sparknlp.base import *

messages = [
    [
        ("system", "You are a helpful assistant."),
        ("assistant", "Hello there, how can I help you?"),
        ("user", "I need help with organizing my room."),
    ]
]
df = spark.createDataFrame([messages]).toDF("messages")


# llama3.1
template = (
    "{{- bos_token }} {%- if custom_tools is defined %} {%- set tools = custom_tools %} {%- "
    "endif %} {%- if not tools_in_user_message is defined %} {%- set tools_in_user_message = true %} {%- "
    'endif %} {%- if not date_string is defined %} {%- set date_string = "26 Jul 2024" %} {%- endif %} '
    "{%- if not tools is defined %} {%- set tools = none %} {%- endif %} {#- This block extracts the "
    "system message, so we can slot it into the right place. #} {%- if messages[0]['role'] == 'system' %}"
    " {%- set system_message = messages[0]['content']|trim %} {%- set messages = messages[1:] %} {%- else"
    ' %} {%- set system_message = "" %} {%- endif %} {#- System message + builtin tools #} {{- '
    '"<|start_header_id|>system<|end_header_id|>\\n\n" }} {%- if builtin_tools is defined or tools is '
    'not none %} {{- "Environment: ipython\\n" }} {%- endif %} {%- if builtin_tools is defined %} {{- '
    '"Tools: " + builtin_tools | reject(\'equalto\', \'code_interpreter\') | join(", ") + "\\n\n"}} '
    '{%- endif %} {{- "Cutting Knowledge Date: December 2023\\n" }} {{- "Today Date: " + date_string '
    '+ "\\n\n" }} {%- if tools is not none and not tools_in_user_message %} {{- "You have access to '
    'the following functions. To call a function, please respond with JSON for a function call." }} {{- '
    '\'Respond in the format {"name": function name, "parameters": dictionary of argument name and its'
    ' value}.\' }} {{- "Do not use variables.\\n\n" }} {%- for t in tools %} {{- t | tojson(indent=4) '
    '}} {{- "\\n\n" }} {%- endfor %} {%- endif %} {{- system_message }} {{- "<|eot_id|>" }} {#- '
    "Custom tools are passed in a user message with some extra guidance #} {%- if tools_in_user_message "
    "and not tools is none %} {#- Extract the first user message so we can plug it in here #} {%- if "
    "messages | length != 0 %} {%- set first_user_message = messages[0]['content']|trim %} {%- set "
    'messages = messages[1:] %} {%- else %} {{- raise_exception("Cannot put tools in the first user '
    "message when there's no first user message!\") }} {%- endif %} {{- "
    "'<|start_header_id|>user<|end_header_id|>\\n\n' -}} {{- \"Given the following functions, please "
    'respond with a JSON for a function call " }} {{- "with its proper arguments that best answers the '
    'given prompt.\\n\n" }} {{- \'Respond in the format {"name": function name, "parameters": '
    'dictionary of argument name and its value}.\' }} {{- "Do not use variables.\\n\n" }} {%- for t in '
    'tools %} {{- t | tojson(indent=4) }} {{- "\\n\n" }} {%- endfor %} {{- first_user_message + '
    "\"<|eot_id|>\"}} {%- endif %} {%- for message in messages %} {%- if not (message.role == 'ipython' "
    "or message.role == 'tool' or 'tool_calls' in message) %} {{- '<|start_header_id|>' + message['role']"
    " + '<|end_header_id|>\\n\n'+ message['content'] | trim + '<|eot_id|>' }} {%- elif 'tool_calls' in "
    'message %} {%- if not message.tool_calls|length == 1 %} {{- raise_exception("This model only '
    'supports single tool-calls at once!") }} {%- endif %} {%- set tool_call = message.tool_calls[0]'
    ".function %} {%- if builtin_tools is defined and tool_call.name in builtin_tools %} {{- "
    "'<|start_header_id|>assistant<|end_header_id|>\\n\n' -}} {{- \"<|python_tag|>\" + tool_call.name + "
    '".call(" }} {%- for arg_name, arg_val in tool_call.arguments | items %} {{- arg_name + \'="\' + '
    'arg_val + \'"\' }} {%- if not loop.last %} {{- ", " }} {%- endif %} {%- endfor %} {{- ")" }} {%- '
    "else %} {{- '<|start_header_id|>assistant<|end_header_id|>\\n\n' -}} {{- '{\"name\": \"' + "
    'tool_call.name + \'", \' }} {{- \'"parameters": \' }} {{- tool_call.arguments | tojson }} {{- "}" '
    "}} {%- endif %} {%- if builtin_tools is defined %} {#- This means we're in ipython mode #} {{- "
    '"<|eom_id|>" }} {%- else %} {{- "<|eot_id|>" }} {%- endif %} {%- elif message.role == "tool" '
    'or message.role == "ipython" %} {{- "<|start_header_id|>ipython<|end_header_id|>\\n\n" }} {%- '
    "if message.content is mapping or message.content is iterable %} {{- message.content | tojson }} {%- "
    'else %} {{- message.content }} {%- endif %} {{- "<|eot_id|>" }} {%- endif %} {%- endfor %} {%- if '
    "add_generation_prompt %} {{- '<|start_header_id|>assistant<|end_header_id|>\\n\n' }} {%- endif %} "
)


prompt_assembler = (
    PromptAssembler()
    .setInputCol("messages")
    .setOutputCol("prompt")
    .setChatTemplate(template)
)

prompt_assembler.transform(df).select("prompt.result").show(truncate=False)
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                      |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[<|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nHello there, how can I help you?<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nI need help with organizing my room.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n]|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
// Batches (whole conversations) of arrays of messages
val data: Seq[Seq[(String, String)]] = Seq(
  Seq(
    ("system", "You are a helpful assistant."),
    ("assistant", "Hello there, how can I help you?"),
    ("user", "I need help with organizing my room.")))

val dataDF = data.toDF("messages")


// llama3.1
val template =
  "{{- bos_token }} {%- if custom_tools is defined %} {%- set tools = custom_tools %} {%- " +
    "endif %} {%- if not tools_in_user_message is defined %} {%- set tools_in_user_message = true %} {%- " +
    "endif %} {%- if not date_string is defined %} {%- set date_string = \"26 Jul 2024\" %} {%- endif %} " +
    "{%- if not tools is defined %} {%- set tools = none %} {%- endif %} {#- This block extracts the " +
    "system message, so we can slot it into the right place. #} {%- if messages[0]['role'] == 'system' %}" +
    " {%- set system_message = messages[0]['content']|trim %} {%- set messages = messages[1:] %} {%- else" +
    " %} {%- set system_message = \"\" %} {%- endif %} {#- System message + builtin tools #} {{- " +
    "\"<|start_header_id|>system<|end_header_id|>\\n\\n\" }} {%- if builtin_tools is defined or tools is " +
    "not none %} {{- \"Environment: ipython\\n\" }} {%- endif %} {%- if builtin_tools is defined %} {{- " +
    "\"Tools: \" + builtin_tools | reject('equalto', 'code_interpreter') | join(\", \") + \"\\n\\n\"}} " +
    "{%- endif %} {{- \"Cutting Knowledge Date: December 2023\\n\" }} {{- \"Today Date: \" + date_string " +
    "+ \"\\n\\n\" }} {%- if tools is not none and not tools_in_user_message %} {{- \"You have access to " +
    "the following functions. To call a function, please respond with JSON for a function call.\" }} {{- " +
    "'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its" +
    " value}.' }} {{- \"Do not use variables.\\n\\n\" }} {%- for t in tools %} {{- t | tojson(indent=4) " +
    "}} {{- \"\\n\\n\" }} {%- endfor %} {%- endif %} {{- system_message }} {{- \"<|eot_id|>\" }} {#- " +
    "Custom tools are passed in a user message with some extra guidance #} {%- if tools_in_user_message " +
    "and not tools is none %} {#- Extract the first user message so we can plug it in here #} {%- if " +
    "messages | length != 0 %} {%- set first_user_message = messages[0]['content']|trim %} {%- set " +
    "messages = messages[1:] %} {%- else %} {{- raise_exception(\"Cannot put tools in the first user " +
    "message when there's no first user message!\") }} {%- endif %} {{- " +
    "'<|start_header_id|>user<|end_header_id|>\\n\\n' -}} {{- \"Given the following functions, please " +
    "respond with a JSON for a function call \" }} {{- \"with its proper arguments that best answers the " +
    "given prompt.\\n\\n\" }} {{- 'Respond in the format {\"name\": function name, \"parameters\": " +
    "dictionary of argument name and its value}.' }} {{- \"Do not use variables.\\n\\n\" }} {%- for t in " +
    "tools %} {{- t | tojson(indent=4) }} {{- \"\\n\\n\" }} {%- endfor %} {{- first_user_message + " +
    "\"<|eot_id|>\"}} {%- endif %} {%- for message in messages %} {%- if not (message.role == 'ipython' " +
    "or message.role == 'tool' or 'tool_calls' in message) %} {{- '<|start_header_id|>' + message['role']" +
    " + '<|end_header_id|>\\n\\n'+ message['content'] | trim + '<|eot_id|>' }} {%- elif 'tool_calls' in " +
    "message %} {%- if not message.tool_calls|length == 1 %} {{- raise_exception(\"This model only " +
    "supports single tool-calls at once!\") }} {%- endif %} {%- set tool_call = message.tool_calls[0]" +
    ".function %} {%- if builtin_tools is defined and tool_call.name in builtin_tools %} {{- " +
    "'<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}} {{- \"<|python_tag|>\" + tool_call.name + " +
    "\".call(\" }} {%- for arg_name, arg_val in tool_call.arguments | items %} {{- arg_name + '=\"' + " +
    "arg_val + '\"' }} {%- if not loop.last %} {{- \", \" }} {%- endif %} {%- endfor %} {{- \")\" }} {%- " +
    "else %} {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}} {{- '{\"name\": \"' + " +
    "tool_call.name + '\", ' }} {{- '\"parameters\": ' }} {{- tool_call.arguments | tojson }} {{- \"}\" " +
    "}} {%- endif %} {%- if builtin_tools is defined %} {#- This means we're in ipython mode #} {{- " +
    "\"<|eom_id|>\" }} {%- else %} {{- \"<|eot_id|>\" }} {%- endif %} {%- elif message.role == \"tool\" " +
    "or message.role == \"ipython\" %} {{- \"<|start_header_id|>ipython<|end_header_id|>\\n\\n\" }} {%- " +
    "if message.content is mapping or message.content is iterable %} {{- message.content | tojson }} {%- " +
    "else %} {{- message.content }} {%- endif %} {{- \"<|eot_id|>\" }} {%- endif %} {%- endfor %} {%- if " +
    "add_generation_prompt %} {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }} {%- endif %} "


val promptAssembler = new PromptAssembler()
  .setInputCol("messages")
  .setOutputCol("prompt")
  .setChatTemplate(template)

promptAssembler.transform(dataDF).select("prompt.result").show(truncate = false)
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                      |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[<|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nHello there, how can I help you?<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nI need help with organizing my room.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n]|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

RecursiveTokenizer

Tokenizes raw text recursively based on a handful of definable rules.

Unlike the Tokenizer, the RecursiveTokenizer operates based on these array string parameters only:

  • prefixes: Strings that will be split when found at the beginning of token.
  • suffixes: Strings that will be split when found at the end of token.
  • infixes: Strings that will be split when found at the middle of token.
  • whitelist: Whitelist of strings not to split

For extended examples of usage, see the Examples and the TokenizerTestSpec.

Input Annotator Types: DOCUMENT

Output Annotator Type: TOKEN

Python API: RecursiveTokenizer Scala API: RecursiveTokenizer Source: RecursiveTokenizer
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = RecursiveTokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer
])

data = spark.createDataFrame([["One, after the Other, (and) again. PO, QAM,"]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.select("token.result").show(truncate=False)
+------------------------------------------------------------------+
|result                                                            |
+------------------------------------------------------------------+
|[One, ,, after, the, Other, ,, (, and, ), again, ., PO, ,, QAM, ,]|
+------------------------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.RecursiveTokenizer
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new RecursiveTokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  tokenizer
))

val data = Seq("One, after the Other, (and) again. PO, QAM,").toDF("text")
val result = pipeline.fit(data).transform(data)

result.select("token.result").show(false)
+------------------------------------------------------------------+
|result                                                            |
+------------------------------------------------------------------+
|[One, ,, after, the, Other, ,, (, and, ), again, ., PO, ,, QAM, ,]|
+------------------------------------------------------------------+

RegexMatcher

Uses rules to match a set of regular expressions and associate them with a provided identifier.

A rule consists of a regex pattern and an identifier, delimited by a character of choice. An example could be "\d{4}\/\d\d\/\d\d,date" which will match strings like "1970/01/01" to the identifier "date".

Rules must be provided by either setRules (followed by setDelimiter) or an external file.

To use an external file, a dictionary of predefined regular expressions must be provided with setExternalRules. The dictionary can be set as a delimited text file.

Pretrained pipelines are available for this module, see Pipelines.

For extended examples of usage, see the Examples and the RegexMatcherTestSpec.

Input Annotator Types: DOCUMENT

Output Annotator Type: CHUNK

Python API: RegexMatcher Scala API: RegexMatcher Source: RegexMatcher
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
# In this example, the `rules.txt` has the form of
#
# the\s\w+, followed by 'the'
# ceremonies, ceremony
#
# where each regex is separated by the identifier by `","`

documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")

sentence = SentenceDetector().setInputCols(["document"]).setOutputCol("sentence")

regexMatcher = RegexMatcher() \
    .setExternalRules("src/test/resources/regex-matcher/rules.txt",  ",") \
    .setInputCols(["sentence"]) \
    .setOutputCol("regex") \
    .setStrategy("MATCH_ALL")

pipeline = Pipeline().setStages([documentAssembler, sentence, regexMatcher])

data = spark.createDataFrame([[
    "My first sentence with the first rule. This is my second sentence with ceremonies rule."
]]).toDF("text")
results = pipeline.fit(data).transform(data)

results.selectExpr("explode(regex) as result").show(truncate=False)
+--------------------------------------------------------------------------------------------+
|result                                                                                      |
+--------------------------------------------------------------------------------------------+
|[chunk, 23, 31, the first, [identifier -> followed by 'the', sentence -> 0, chunk -> 0], []]|
|[chunk, 71, 80, ceremonies, [identifier -> ceremony, sentence -> 1, chunk -> 0], []]        |
+--------------------------------------------------------------------------------------------+
// In this example, the `rules.txt` has the form of
//
// the\s\w+, followed by 'the'
// ceremonies, ceremony
//
// where each regex is separated by the identifier by `","`
import ResourceHelper.spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.SentenceDetector
import com.johnsnowlabs.nlp.annotators.RegexMatcher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document")

val sentence = new SentenceDetector().setInputCols("document").setOutputCol("sentence")

val regexMatcher = new RegexMatcher()
  .setExternalRules("src/test/resources/regex-matcher/rules.txt",  ",")
  .setInputCols(Array("sentence"))
  .setOutputCol("regex")
  .setStrategy("MATCH_ALL")

val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, regexMatcher))

val data = Seq(
  "My first sentence with the first rule. This is my second sentence with ceremonies rule."
).toDF("text")
val results = pipeline.fit(data).transform(data)

results.selectExpr("explode(regex) as result").show(false)
+--------------------------------------------------------------------------------------------+
|result                                                                                      |
+--------------------------------------------------------------------------------------------+
|[chunk, 23, 31, the first, [identifier -> followed by 'the', sentence -> 0, chunk -> 0], []]|
|[chunk, 71, 80, ceremonies, [identifier -> ceremony, sentence -> 1, chunk -> 0], []]        |
+--------------------------------------------------------------------------------------------+

RegexTokenizer

A tokenizer that splits text by a regex pattern.

The pattern needs to be set with setPattern and this sets the delimiting pattern or how the tokens should be split. By default this pattern is \s+ which means that tokens should be split by 1 or more whitespace characters.

Input Annotator Types: DOCUMENT

Output Annotator Type: TOKEN

Python API: RegexTokenizer Scala API: RegexTokenizer Source: RegexTokenizer
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

regexTokenizer = RegexTokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("regexToken") \
    .setToLowercase(True) \
    .setPattern("\\s+")

pipeline = Pipeline().setStages([
      documentAssembler,
      regexTokenizer
    ])

data = spark.createDataFrame([["This is my first sentence.\nThis is my second."]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("regexToken.result").show(truncate=False)
+-------------------------------------------------------+
|result                                                 |
+-------------------------------------------------------+
|[this, is, my, first, sentence., this, is, my, second.]|
+-------------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.RegexTokenizer
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val regexTokenizer = new RegexTokenizer()
  .setInputCols("document")
  .setOutputCol("regexToken")
  .setToLowercase(true)
  .setPattern("\\s+")

val pipeline = new Pipeline().setStages(Array(
    documentAssembler,
    regexTokenizer
  ))

val data = Seq("This is my first sentence.\nThis is my second.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("regexToken.result").show(false)
+-------------------------------------------------------+
|result                                                 |
+-------------------------------------------------------+
|[this, is, my, first, sentence., this, is, my, second.]|
+-------------------------------------------------------+

SentenceDetector

Annotator that detects sentence boundaries using regular expressions.

The following characters are checked as sentence boundaries:

  1. Lists (“(i), (ii)”, “(a), (b)”, “1., 2.”)
  2. Numbers
  3. Abbreviations
  4. Punctuations
  5. Multiple Periods
  6. Geo-Locations/Coordinates (“N°. 1026.253.553.”)
  7. Ellipsis (“…”)
  8. In-between punctuations
  9. Quotation marks
  10. Exclamation Points
  11. Basic Breakers (“.”, “;”)

For the explicit regular expressions used for detection, refer to source of PragmaticContentFormatter.

To add additional custom bounds, the parameter customBounds can be set with an array:

val sentence = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")
  .setCustomBounds(Array("\n\n"))

If only the custom bounds should be used, then the parameter useCustomBoundsOnly should be set to true.

Each extracted sentence can be returned in an Array or exploded to separate rows, if explodeSentences is set to true.

For extended examples of usage, see the Examples.

Input Annotator Types: DOCUMENT

Output Annotator Type: DOCUMENT

Python API: SentenceDetector Scala API: SentenceDetector Source: SentenceDetector
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence") \
    .setCustomBounds(["\n\n"])

pipeline = Pipeline().setStages([
    documentAssembler,
    sentence
])

data = spark.createDataFrame([["This is my first sentence. This my second. How about a third?"]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("explode(sentence) as sentences").show(truncate=False)
+------------------------------------------------------------------+
|sentences                                                         |
+------------------------------------------------------------------+
|[document, 0, 25, This is my first sentence., [sentence -> 0], []]|
|[document, 27, 41, This my second., [sentence -> 1], []]          |
|[document, 43, 60, How about a third?, [sentence -> 2], []]       |
+------------------------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.SentenceDetector
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentence = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")
  .setCustomBounds(Array("\n\n"))

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentence
))

val data = Seq("This is my first sentence. This my second. How about a third?").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(sentence) as sentences").show(false)
+------------------------------------------------------------------+
|sentences                                                         |
+------------------------------------------------------------------+
|[document, 0, 25, This is my first sentence., [sentence -> 0], []]|
|[document, 27, 41, This my second., [sentence -> 1], []]          |
|[document, 43, 60, How about a third?, [sentence -> 2], []]       |
+------------------------------------------------------------------+

SentenceDetectorDL

Trains an annotator that detects sentence boundaries using a deep learning approach.

For pretrained models see SentenceDetectorDLModel.

Currently, only the CNN model is supported for training, but in the future the architecture of the model can be set with setModelArchitecture.

The default model "cnn" is based on the paper Deep-EOS: General-Purpose Neural Networks for Sentence Boundary Detection (2020, Stefan Schweter, Sajawel Ahmed) using a CNN architecture. We also modified the original implementation a little bit to cover broken sentences and some impossible end of line chars.

Each extracted sentence can be returned in an Array or exploded to separate rows, if explodeSentences is set to true.

For extended examples of usage, see the Examples and the SentenceDetectorDLSpec.

Input Annotator Types: DOCUMENT

Output Annotator Type: DOCUMENT

Python API: SentenceDetectorDLApproach Scala API: SentenceDetectorDLApproach Source: SentenceDetectorDLApproach
Show Example
# The training process needs data, where each data point is a sentence.
# In this example the `train.txt` file has the form of
#
# ...
# Slightly more moderate language would make our present situation – namely the lack of progress – a little easier.
# His political successors now have great responsibilities to history and to the heritage of values bequeathed to them by Nelson Mandela.
# ...
#
# where each line is one sentence.
# Training can then be started like so:

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

trainingData = spark.read.text("train.txt").toDF("text")

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = SentenceDetectorDLApproach() \
    .setInputCols(["document"]) \
    .setOutputCol("sentences") \
    .setEpochsNumber(100)

pipeline = Pipeline().setStages([documentAssembler, sentenceDetector])

model = pipeline.fit(trainingData)
// The training process needs data, where each data point is a sentence.
// In this example the `train.txt` file has the form of
//
// ...
// Slightly more moderate language would make our present situation – namely the lack of progress – a little easier.
// His political successors now have great responsibilities to history and to the heritage of values bequeathed to them by Nelson Mandela.
// ...
//
// where each line is one sentence.
// Training can then be started like so:
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.sentence_detector_dl.SentenceDetectorDLApproach
import org.apache.spark.ml.Pipeline

val trainingData = spark.read.text("train.txt").toDF("text")

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetectorDLApproach()
  .setInputCols(Array("document"))
  .setOutputCol("sentences")
  .setEpochsNumber(100)

val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector))

val model = pipeline.fit(trainingData)

SentenceEmbeddings

Converts the results from WordEmbeddings, BertEmbeddings, or ElmoEmbeddings into sentence or document embeddings by either summing up or averaging all the word embeddings in a sentence or a document (depending on the inputCols).

This can be configured with setPoolingStrategy, which either be "AVERAGE" or "SUM".

For more extended examples see the Examples. and the SentenceEmbeddingsTestSpec.

TIP: Here is how you can explode and convert these embeddings into Vectors or what’s known as Feature column so it can be used in Spark ML regression or clustering functions:

from org.apache.spark.ml.linal import Vector, Vectors
from pyspark.sql.functions import udf
# Let's create a UDF to take array of embeddings and output Vectors
@udf(Vector)
def convertToVectorUDF(matrix):
    return Vectors.dense(matrix.toArray.map(_.toDouble))


# Now let's explode the sentence_embeddings column and have a new feature column for Spark ML
pipelineDF.select(explode("sentence_embeddings.embeddings").as("sentence_embedding"))
.withColumn("features", convertToVectorUDF("sentence_embedding"))
import org.apache.spark.ml.linalg.{Vector, Vectors}

// Let's create a UDF to take array of embeddings and output Vectors
val convertToVectorUDF = udf((matrix : Seq[Float]) => {
    Vectors.dense(matrix.toArray.map(_.toDouble))
})

// Now let's explode the sentence_embeddings column and have a new feature column for Spark ML
pipelineDF.select(explode($"sentence_embeddings.embeddings").as("sentence_embedding"))
.withColumn("features", convertToVectorUDF($"sentence_embedding"))

Input Annotator Types: DOCUMENT, WORD_EMBEDDINGS

Output Annotator Type: SENTENCE_EMBEDDINGS

Note: If you choose document as your input for Tokenizer, WordEmbeddings/BertEmbeddings, and SentenceEmbeddings then it averages/sums all the embeddings into one array of embeddings. However, if you choose sentence as inputCols then for each sentence SentenceEmbeddings generates one array of embeddings.

Python API: SentenceEmbeddings Scala API: SentenceEmbeddings Source: SentenceEmbeddings
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

embeddings = WordEmbeddingsModel.pretrained() \
    .setInputCols(["document", "token"]) \
    .setOutputCol("embeddings")

embeddingsSentence = SentenceEmbeddings() \
    .setInputCols(["document", "embeddings"]) \
    .setOutputCol("sentence_embeddings") \
    .setPoolingStrategy("AVERAGE")

embeddingsFinisher = EmbeddingsFinisher() \
    .setInputCols(["sentence_embeddings"]) \
    .setOutputCols("finished_embeddings") \
    .setOutputAsVector(True) \
    .setCleanAnnotations(False)

pipeline = Pipeline() \
    .setStages([
      documentAssembler,
      tokenizer,
      embeddings,
      embeddingsSentence,
      embeddingsFinisher
    ])

data = spark.createDataFrame([["This is a sentence."]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[-0.22093398869037628,0.25130119919776917,0.41810303926467896,-0.380883991718...|
+--------------------------------------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel
import com.johnsnowlabs.nlp.embeddings.SentenceEmbeddings
import com.johnsnowlabs.nlp.EmbeddingsFinisher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols(Array("document"))
  .setOutputCol("token")

val embeddings = WordEmbeddingsModel.pretrained()
  .setInputCols("document", "token")
  .setOutputCol("embeddings")

val embeddingsSentence = new SentenceEmbeddings()
  .setInputCols(Array("document", "embeddings"))
  .setOutputCol("sentence_embeddings")
  .setPoolingStrategy("AVERAGE")

val embeddingsFinisher = new EmbeddingsFinisher()
  .setInputCols("sentence_embeddings")
  .setOutputCols("finished_embeddings")
  .setOutputAsVector(true)
  .setCleanAnnotations(false)

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    tokenizer,
    embeddings,
    embeddingsSentence,
    embeddingsFinisher
  ))

val data = Seq("This is a sentence.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[-0.22093398869037628,0.25130119919776917,0.41810303926467896,-0.380883991718...|
+--------------------------------------------------------------------------------+

SentimentDL

Trains a SentimentDL, an annotator for multi-class sentiment analysis.

In natural language processing, sentiment analysis is the task of classifying the affective state or subjective view of a text. A common example is if either a product review or tweet can be interpreted positively or negatively.

For the instantiated/pretrained models, see SentimentDLModel.

Notes:

  • This annotator accepts a label column of a single item in either type of String, Int, Float, or Double. So positive sentiment can be expressed as either "positive" or 0, negative sentiment as "negative" or 1.
  • UniversalSentenceEncoder, BertSentenceEmbeddings, SentenceEmbeddings or other sentence based embeddings can be used

Setting a test dataset to monitor model metrics can be done with .setTestDataset. The method expects a path to a parquet file containing a dataframe that has the same required columns as the training dataframe. The pre-processing steps for the training dataframe should also be applied to the test dataframe. The following example will show how to create the test dataset:

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val embeddings = UniversalSentenceEncoder.pretrained()
  .setInputCols("document")
  .setOutputCol("sentence_embeddings")

val preProcessingPipeline = new Pipeline().setStages(Array(documentAssembler, embeddings))

val Array(train, test) = data.randomSplit(Array(0.8, 0.2))
preProcessingPipeline
  .fit(test)
  .transform(test)
  .write
  .mode("overwrite")
  .parquet("test_data")

val classifier = new SentimentDLApproach()
  .setInputCols("sentence_embeddings")
  .setOutputCol("sentiment")
  .setLabelColumn("label")
  .setTestDataset("test_data")

For extended examples of usage, see the Examples and the SentimentDLTestSpec.

Input Annotator Types: SENTENCE_EMBEDDINGS

Output Annotator Type: CATEGORY

Python API: SentimentDLApproach Scala API: SentimentDLApproach Source: SentimentDLApproach
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
# In this example, `sentiment.csv` is in the form
#
# text,label
# This movie is the best movie I have watched ever! In my opinion this movie can win an award.,0
# This was a terrible movie! The acting was bad really bad!,1
#
# The model can then be trained with

smallCorpus = spark.read.option("header", "True").csv("src/test/resources/classifier/sentiment.csv")

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

useEmbeddings = UniversalSentenceEncoder.pretrained() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence_embeddings")

docClassifier = SentimentDLApproach() \
    .setInputCols(["sentence_embeddings"]) \
    .setOutputCol("sentiment") \
    .setLabelColumn("label") \
    .setBatchSize(32) \
    .setMaxEpochs(1) \
    .setLr(5e-3) \
    .setDropout(0.5)

pipeline = Pipeline() \
    .setStages(
      [
        documentAssembler,
        useEmbeddings,
        docClassifier
      ]
    )

pipelineModel = pipeline.fit(smallCorpus)
// In this example, `sentiment.csv` is in the form
//
// text,label
// This movie is the best movie I have watched ever! In my opinion this movie can win an award.,0
// This was a terrible movie! The acting was bad really bad!,1
//
// The model can then be trained with
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.UniversalSentenceEncoder
import com.johnsnowlabs.nlp.annotators.classifier.dl.{SentimentDLApproach, SentimentDLModel}
import org.apache.spark.ml.Pipeline

val smallCorpus = spark.read.option("header", "true").csv("src/test/resources/classifier/sentiment.csv")

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val useEmbeddings = UniversalSentenceEncoder.pretrained()
  .setInputCols("document")
  .setOutputCol("sentence_embeddings")

val docClassifier = new SentimentDLApproach()
  .setInputCols("sentence_embeddings")
  .setOutputCol("sentiment")
  .setLabelColumn("label")
  .setBatchSize(32)
  .setMaxEpochs(1)
  .setLr(5e-3f)
  .setDropout(0.5f)

val pipeline = new Pipeline()
  .setStages(
    Array(
      documentAssembler,
      useEmbeddings,
      docClassifier
    )
  )

val pipelineModel = pipeline.fit(smallCorpus)

SentimentDetector

Trains a rule based sentiment detector, which calculates a score based on predefined keywords.

A dictionary of predefined sentiment keywords must be provided with setDictionary, where each line is a word delimited to its class (either positive or negative). The dictionary can be set as a delimited text file.

By default, the sentiment score will be assigned labels "positive" if the score is >= 0, else "negative". To retrieve the raw sentiment scores, enableScore needs to be set to true.

For extended examples of usage, see the Examples and the SentimentTestSpec.

Input Annotator Types: TOKEN, DOCUMENT

Output Annotator Type: SENTIMENT

Python API: SentimentDetector Scala API: SentimentDetector Source: SentimentDetector
Show Example
# In this example, the dictionary `default-sentiment-dict.txt` has the form of
#
# ...
# cool,positive
# superb,positive
# bad,negative
# uninspired,negative
# ...
#
# where each sentiment keyword is delimited by `","`.

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

lemmatizer = Lemmatizer() \
    .setInputCols(["token"]) \
    .setOutputCol("lemma") \
    .setDictionary("lemmas_small.txt", "->", "\t")

sentimentDetector = SentimentDetector() \
    .setInputCols(["lemma", "document"]) \
    .setOutputCol("sentimentScore") \
    .setDictionary("default-sentiment-dict.txt", ",", ReadAs.TEXT)

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    lemmatizer,
    sentimentDetector,
])

data = spark.createDataFrame([
    ["The staff of the restaurant is nice"],
    ["I recommend others to avoid because it is too expensive"]
]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("sentimentScore.result").show(truncate=False)
+----------+  #  +------+ for enableScore set to True
|result    |  #  |result|
+----------+  #  +------+
|[positive]|  #  |[1.0] |
|[negative]|  #  |[-2.0]|
+----------+  #  +------+
// In this example, the dictionary `default-sentiment-dict.txt` has the form of
//
// ...
// cool,positive
// superb,positive
// bad,negative
// uninspired,negative
// ...
//
// where each sentiment keyword is delimited by `","`.

import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.Tokenizer
import com.johnsnowlabs.nlp.annotators.Lemmatizer
import com.johnsnowlabs.nlp.annotators.sda.pragmatic.SentimentDetector
import com.johnsnowlabs.nlp.util.io.ReadAs
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val lemmatizer = new Lemmatizer()
  .setInputCols("token")
  .setOutputCol("lemma")
  .setDictionary("src/test/resources/lemma-corpus-small/lemmas_small.txt", "->", "\t")

val sentimentDetector = new SentimentDetector()
  .setInputCols("lemma", "document")
  .setOutputCol("sentimentScore")
  .setDictionary("src/test/resources/sentiment-corpus/default-sentiment-dict.txt", ",", ReadAs.TEXT)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  tokenizer,
  lemmatizer,
  sentimentDetector,
))

val data = Seq(
  "The staff of the restaurant is nice",
  "I recommend others to avoid because it is too expensive"
).toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("sentimentScore.result").show(false)
+----------+  //  +------+ for enableScore set to true
|result    |  //  |result|
+----------+  //  +------+
|[positive]|  //  |[1.0] |
|[negative]|  //  |[-2.0]|
+----------+  //  +------+

Stemmer

Returns hard-stems out of words with the objective of retrieving the meaningful part of the word. For extended examples of usage, see the Examples.

Input Annotator Types: TOKEN

Output Annotator Type: TOKEN

Python API: Stemmer Scala API: Stemmer Source: Stemmer
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

stemmer = Stemmer() \
    .setInputCols(["token"]) \
    .setOutputCol("stem")

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    stemmer
])

data = spark.createDataFrame([["Peter Pipers employees are picking pecks of pickled peppers."]]) \
    .toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("stem.result").show(truncate = False)
+-------------------------------------------------------------+
|result                                                       |
+-------------------------------------------------------------+
|[peter, piper, employe, ar, pick, peck, of, pickl, pepper, .]|
+-------------------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.{Stemmer, Tokenizer}
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val stemmer = new Stemmer()
  .setInputCols("token")
  .setOutputCol("stem")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  tokenizer,
  stemmer
))

val data = Seq("Peter Pipers employees are picking pecks of pickled peppers.")
  .toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("stem.result").show(truncate = false)
+-------------------------------------------------------------+
|result                                                       |
+-------------------------------------------------------------+
|[peter, piper, employe, ar, pick, peck, of, pickl, pepper, .]|
+-------------------------------------------------------------+

StopWordsCleaner

This annotator takes a sequence of strings (e.g. the output of a Tokenizer, Normalizer, Lemmatizer, and Stemmer) and drops all the stop words from the input sequences.

By default, it uses stop words from MLlibs StopWordsRemover. Stop words can also be defined by explicitly setting them with setStopWords(value: Array[String]) or loaded from pretrained models using pretrained of its companion object.

val stopWords = StopWordsCleaner.pretrained()
  .setInputCols("token")
  .setOutputCol("cleanTokens")
  .setCaseSensitive(false)
// will load the default pretrained model `"stopwords_en"`.

For available pretrained models please see the Models Hub.

For extended examples of usage, see the Examples and StopWordsCleanerTestSpec.

NOTE: If you need to setStopWords from a text file, you can first read and convert it into an array of string as follows.

# your stop words text file, each line is one stop word
stopwords = sc.textFile("/tmp/stopwords/english.txt").collect()

# simply use it in StopWordsCleaner
stopWordsCleaner = StopWordsCleaner()\
      .setInputCols("token")\
      .setOutputCol("cleanTokens")\
      .setStopWords(stopwords)\
      .setCaseSensitive(False)

# or you can use pretrained models for StopWordsCleaner
stopWordsCleaner = StopWordsCleaner.pretrained()
      .setInputCols("token")\
      .setOutputCol("cleanTokens")\
      .setCaseSensitive(False)

// your stop words text file, each line is one stop word
val stopwords = sc.textFile("/tmp/stopwords/english.txt").collect()

// simply use it in StopWordsCleaner
val stopWordsCleaner = new StopWordsCleaner()
      .setInputCols("token")
      .setOutputCol("cleanTokens")
      .setStopWords(stopwords)
      .setCaseSensitive(false)

// or you can use pretrained models for StopWordsCleaner
val stopWordsCleaner = StopWordsCleaner.pretrained()
      .setInputCols("token")
      .setOutputCol("cleanTokens")
      .setCaseSensitive(false)

Input Annotator Types: TOKEN

Output Annotator Type: TOKEN

Python API: StopWordsCleaner Scala API: StopWordsCleaner Source: StopWordsCleaner
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

stopWords = StopWordsCleaner() \
    .setInputCols(["token"]) \
    .setOutputCol("cleanTokens") \
    .setCaseSensitive(False)

pipeline = Pipeline().setStages([
      documentAssembler,
      sentenceDetector,
      tokenizer,
      stopWords
    ])

data = spark.createDataFrame([
    ["This is my first sentence. This is my second."],
    ["This is my third sentence. This is my forth."]
]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("cleanTokens.result").show(truncate=False)
+-------------------------------+
|result                         |
+-------------------------------+
|[first, sentence, ., second, .]|
|[third, sentence, ., forth, .] |
+-------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.Tokenizer
import com.johnsnowlabs.nlp.annotator.SentenceDetector
import com.johnsnowlabs.nlp.annotators.StopWordsCleaner
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val stopWords = new StopWordsCleaner()
  .setInputCols("token")
  .setOutputCol("cleanTokens")
  .setCaseSensitive(false)

val pipeline = new Pipeline().setStages(Array(
    documentAssembler,
    sentenceDetector,
    tokenizer,
    stopWords
  ))

val data = Seq(
  "This is my first sentence. This is my second.",
  "This is my third sentence. This is my forth."
).toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("cleanTokens.result").show(false)
+-------------------------------+
|result                         |
+-------------------------------+
|[first, sentence, ., second, .]|
|[third, sentence, ., forth, .] |
+-------------------------------+

SymmetricDelete Spellchecker

Trains a Symmetric Delete spelling correction algorithm. Retrieves tokens and utilizes distance metrics to compute possible derived words.

Inspired by SymSpell.

For instantiated/pretrained models, see SymmetricDeleteModel.

See SymmetricDeleteModelTestSpec for further reference.

Input Annotator Types: TOKEN

Output Annotator Type: TOKEN

Python API: SymmetricDeleteApproach Scala API: SymmetricDeleteApproach Source: SymmetricDeleteApproach
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
# In this example, the dictionary `"words.txt"` has the form of
#
# ...
# gummy
# gummic
# gummier
# gummiest
# gummiferous
# ...
#
# This dictionary is then set to be the basis of the spell checker.

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

spellChecker = SymmetricDeleteApproach() \
    .setInputCols(["token"]) \
    .setOutputCol("spell") \
    .setDictionary("src/test/resources/spell/words.txt")

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    spellChecker
])

pipelineModel = pipeline.fit(trainingData)
// In this example, the dictionary `"words.txt"` has the form of
//
// ...
// gummy
// gummic
// gummier
// gummiest
// gummiferous
// ...
//
// This dictionary is then set to be the basis of the spell checker.
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.spell.symmetric.SymmetricDeleteApproach
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val spellChecker = new SymmetricDeleteApproach()
  .setInputCols("token")
  .setOutputCol("spell")
  .setDictionary("src/test/resources/spell/words.txt")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  tokenizer,
  spellChecker
))

val pipelineModel = pipeline.fit(trainingData)

TextMatcher

Annotator to match exact phrases (by token) provided in a file against a Document.

A text file of predefined phrases must be provided with setEntities.

For extended examples of usage, see the Examples and the TextMatcherTestSpec.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: CHUNK

Python API: TextMatcher Scala API: TextMatcher Source: TextMatcher
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
# In this example, the entities file is of the form
#
# ...
# dolore magna aliqua
# lorem ipsum dolor. sit
# laborum
# ...
#
# where each line represents an entity phrase to be extracted.

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

data = spark.createDataFrame([["Hello dolore magna aliqua. Lorem ipsum dolor. sit in laborum"]]).toDF("text")
entityExtractor = TextMatcher() \
    .setInputCols(["document", "token"]) \
    .setEntities("src/test/resources/entity-extractor/test-phrases.txt", ReadAs.TEXT) \
    .setOutputCol("entity") \
    .setCaseSensitive(False)

pipeline = Pipeline().setStages([documentAssembler, tokenizer, entityExtractor])
results = pipeline.fit(data).transform(data)

results.selectExpr("explode(entity) as result").show(truncate=False)
+------------------------------------------------------------------------------------------+
|result                                                                                    |
+------------------------------------------------------------------------------------------+
|[chunk, 6, 24, dolore magna aliqua, [entity -> entity, sentence -> 0, chunk -> 0], []]    |
|[chunk, 27, 48, Lorem ipsum dolor. sit, [entity -> entity, sentence -> 0, chunk -> 1], []]|
|[chunk, 53, 59, laborum, [entity -> entity, sentence -> 0, chunk -> 2], []]               |
+------------------------------------------------------------------------------------------+
// In this example, the entities file is of the form
//
// ...
// dolore magna aliqua
// lorem ipsum dolor. sit
// laborum
// ...
//
// where each line represents an entity phrase to be extracted.
import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.Tokenizer
import com.johnsnowlabs.nlp.annotator.TextMatcher
import com.johnsnowlabs.nlp.util.io.ReadAs
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val data = Seq("Hello dolore magna aliqua. Lorem ipsum dolor. sit in laborum").toDF("text")
val entityExtractor = new TextMatcher()
  .setInputCols("document", "token")
  .setEntities("src/test/resources/entity-extractor/test-phrases.txt", ReadAs.TEXT)
  .setOutputCol("entity")
  .setCaseSensitive(false)
  .setTokenizer(tokenizer.fit(data))

val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, entityExtractor))
val results = pipeline.fit(data).transform(data)

results.selectExpr("explode(entity) as result").show(false)
+------------------------------------------------------------------------------------------+
|result                                                                                    |
+------------------------------------------------------------------------------------------+
|[chunk, 6, 24, dolore magna aliqua, [entity -> entity, sentence -> 0, chunk -> 0], []]    |
|[chunk, 27, 48, Lorem ipsum dolor. sit, [entity -> entity, sentence -> 0, chunk -> 1], []]|
|[chunk, 53, 59, laborum, [entity -> entity, sentence -> 0, chunk -> 2], []]               |
+------------------------------------------------------------------------------------------+

Token2Chunk

Converts TOKEN type Annotations to CHUNK type.

This can be useful if a entities have been already extracted as TOKEN and following annotators require CHUNK types.

Input Annotator Types: TOKEN

Output Annotator Type: CHUNK

Python API: Token2Chunk Scala API: Token2Chunk Source: Token2Chunk
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline


documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

token2chunk = Token2Chunk() \
    .setInputCols(["token"]) \
    .setOutputCol("chunk")

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    token2chunk
])

data = spark.createDataFrame([["One Two Three Four"]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("explode(chunk) as result").show(truncate=False)
+------------------------------------------+
|result                                    |
+------------------------------------------+
|[chunk, 0, 2, One, [sentence -> 0], []]   |
|[chunk, 4, 6, Two, [sentence -> 0], []]   |
|[chunk, 8, 12, Three, [sentence -> 0], []]|
|[chunk, 14, 17, Four, [sentence -> 0], []]|
+------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.{Token2Chunk, Tokenizer}

import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val token2chunk = new Token2Chunk()
  .setInputCols("token")
  .setOutputCol("chunk")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  tokenizer,
  token2chunk
))

val data = Seq("One Two Three Four").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(chunk) as result").show(false)
+------------------------------------------+
|result                                    |
+------------------------------------------+
|[chunk, 0, 2, One, [sentence -> 0], []]   |
|[chunk, 4, 6, Two, [sentence -> 0], []]   |
|[chunk, 8, 12, Three, [sentence -> 0], []]|
|[chunk, 14, 17, Four, [sentence -> 0], []]|
+------------------------------------------+

TokenAssembler

This transformer reconstructs a DOCUMENT type annotation from tokens, usually after these have been normalized, lemmatized, normalized, spell checked, etc, in order to use this document annotation in further annotators. Requires DOCUMENT and TOKEN type annotations as input.

For more extended examples on document pre-processing see the Examples.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: DOCUMENT

Python API: TokenAssembler Scala API: TokenAssembler Source: TokenAssembler
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

# First, the text is tokenized and cleaned
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentences")

tokenizer = Tokenizer() \
    .setInputCols(["sentences"]) \
    .setOutputCol("token")

normalizer = Normalizer() \
    .setInputCols(["token"]) \
    .setOutputCol("normalized") \
    .setLowercase(False)

stopwordsCleaner = StopWordsCleaner() \
    .setInputCols(["normalized"]) \
    .setOutputCol("cleanTokens") \
    .setCaseSensitive(False)

# Then the TokenAssembler turns the cleaned tokens into a `DOCUMENT` type structure.
tokenAssembler = TokenAssembler() \
    .setInputCols(["sentences", "cleanTokens"]) \
    .setOutputCol("cleanText")

data = spark.createDataFrame([["Spark NLP is an open-source text processing library for advanced natural language processing."]]) \
    .toDF("text")

pipeline = Pipeline().setStages([
    documentAssembler,
    sentenceDetector,
    tokenizer,
    normalizer,
    stopwordsCleaner,
    tokenAssembler
]).fit(data)

result = pipeline.transform(data)
result.select("cleanText").show(truncate=False)
+---------------------------------------------------------------------------------------------------------------------------+
|cleanText                                                                                                                  |
+---------------------------------------------------------------------------------------------------------------------------+
|0, 80, Spark NLP opensource text processing library advanced natural language processing, [sentence -> 0], []|
+---------------------------------------------------------------------------------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.SentenceDetector
import com.johnsnowlabs.nlp.annotator.Tokenizer
import com.johnsnowlabs.nlp.annotator.{Normalizer, StopWordsCleaner}
import com.johnsnowlabs.nlp.TokenAssembler
import org.apache.spark.ml.Pipeline

// First, the text is tokenized and cleaned
val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentences")

val tokenizer = new Tokenizer()
  .setInputCols("sentences")
  .setOutputCol("token")

val normalizer = new Normalizer()
  .setInputCols("token")
  .setOutputCol("normalized")
  .setLowercase(false)

val stopwordsCleaner = new StopWordsCleaner()
  .setInputCols("normalized")
  .setOutputCol("cleanTokens")
  .setCaseSensitive(false)

// Then the TokenAssembler turns the cleaned tokens into a `DOCUMENT` type structure.
val tokenAssembler = new TokenAssembler()
  .setInputCols("sentences", "cleanTokens")
  .setOutputCol("cleanText")

val data = Seq("Spark NLP is an open-source text processing library for advanced natural language processing.")
  .toDF("text")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  normalizer,
  stopwordsCleaner,
  tokenAssembler
)).fit(data)

val result = pipeline.transform(data)
result.select("cleanText").show(false)
+---------------------------------------------------------------------------------------------------------------------------+
|cleanText                                                                                                                  |
+---------------------------------------------------------------------------------------------------------------------------+
|[[document, 0, 80, Spark NLP opensource text processing library advanced natural language processing, [sentence -> 0], []]]|
+---------------------------------------------------------------------------------------------------------------------------+

Tokenizer

Tokenizes raw text in document type columns into TokenizedSentence .

This class represents a non fitted tokenizer. Fitting it will cause the internal RuleFactory to construct the rules for tokenizing from the input configuration.

Identifies tokens with tokenization open standards. A few rules will help customizing it if defaults do not fit user needs.

For extended examples of usage see the Examples and Tokenizer test class

Input Annotator Types: DOCUMENT

Output Annotator Type: TOKEN

Note: All these APIs receive regular expressions so please make sure that you escape special characters according to Java conventions.

Python API: Tokenizer Scala API: Tokenizer Source: Tokenizer
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

data = spark.createDataFrame([["I'd like to say we didn't expect that. Jane's boyfriend."]]).toDF("text")
documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
tokenizer = Tokenizer().setInputCols(["document"]).setOutputCol("token").fit(data)

pipeline = Pipeline().setStages([documentAssembler, tokenizer]).fit(data)
result = pipeline.transform(data)

result.selectExpr("token.result").show(truncate=False)
+-----------------------------------------------------------------------+
|output                                                                 |
+-----------------------------------------------------------------------+
|[I'd, like, to, say, we, didn't, expect, that, ., Jane's, boyfriend, .]|
+-----------------------------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import org.apache.spark.ml.Pipeline

val data = Seq("I'd like to say we didn't expect that. Jane's boyfriend.").toDF("text")
val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document")
val tokenizer = new Tokenizer().setInputCols("document").setOutputCol("token").fit(data)

val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer)).fit(data)
val result = pipeline.transform(data)

result.selectExpr("token.result").show(false)
+-----------------------------------------------------------------------+
|output                                                                 |
+-----------------------------------------------------------------------+
|[I'd, like, to, say, we, didn't, expect, that, ., Jane's, boyfriend, .]|
+-----------------------------------------------------------------------+

TypedDependencyParser

Labeled parser that finds a grammatical relation between two words in a sentence. Its input is either a CoNLL2009 or ConllU dataset.

For instantiated/pretrained models, see TypedDependencyParserModel.

Dependency parsers provide information about word relationship. For example, dependency parsing can tell you what the subjects and objects of a verb are, as well as which words are modifying (describing) the subject. This can help you find precise answers to specific questions.

The parser requires the dependant tokens beforehand with e.g. DependencyParser. The required training data can be set in two different ways (only one can be chosen for a particular model):

Apart from that, no additional training data is needed.

See TypedDependencyParserApproachTestSpec for further reference on this API.

Input Annotator Types: TOKEN, POS, DEPENDENCY

Output Annotator Type: LABELED_DEPENDENCY

Python API: TypedDependencyParserApproach Scala API: TypedDependencyParserApproach Source: TypedDependencyParserApproach
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

posTagger = PerceptronModel.pretrained() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("pos")

dependencyParser = DependencyParserModel.pretrained() \
    .setInputCols(["sentence", "pos", "token"]) \
    .setOutputCol("dependency")

typedDependencyParser = TypedDependencyParserApproach() \
    .setInputCols(["dependency", "pos", "token"]) \
    .setOutputCol("dependency_type") \
    .setConllU("src/test/resources/parser/labeled/train_small.conllu.txt") \
    .setNumberOfIterations(1)

pipeline = Pipeline().setStages([
    documentAssembler,
    sentence,
    tokenizer,
    posTagger,
    dependencyParser,
    typedDependencyParser
])

# Additional training data is not needed, the dependency parser relies on CoNLL-U only.
emptyDataSet = spark.createDataFrame([[""]]).toDF("text")
pipelineModel = pipeline.fit(emptyDataSet)
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel
import com.johnsnowlabs.nlp.annotators.parser.dep.DependencyParserModel
import com.johnsnowlabs.nlp.annotators.parser.typdep.TypedDependencyParserApproach
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentence = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

val posTagger = PerceptronModel.pretrained()
  .setInputCols("sentence", "token")
  .setOutputCol("pos")

val dependencyParser = DependencyParserModel.pretrained()
  .setInputCols("sentence", "pos", "token")
  .setOutputCol("dependency")

val typedDependencyParser = new TypedDependencyParserApproach()
  .setInputCols("dependency", "pos", "token")
  .setOutputCol("dependency_type")
  .setConllU("src/test/resources/parser/labeled/train_small.conllu.txt")
  .setNumberOfIterations(1)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentence,
  tokenizer,
  posTagger,
  dependencyParser,
  typedDependencyParser
))

// Additional training data is not needed, the dependency parser relies on CoNLL-U only.
val emptyDataSet = Seq.empty[String].toDF("text")
val pipelineModel = pipeline.fit(emptyDataSet)

ViveknSentiment

Trains a sentiment analyser inspired by the algorithm by Vivek Narayanan https://github.com/vivekn/sentiment/.

The algorithm is based on the paper “Fast and accurate sentiment classification using an enhanced Naive Bayes model”.

The analyzer requires sentence boundaries to give a score in context. Tokenization is needed to make sure tokens are within bounds. Transitivity requirements are also required.

The training data needs to consist of a column for normalized text and a label column (either "positive" or "negative").

For extended examples of usage, see the Examples and the ViveknSentimentTestSpec.

Input Annotator Types: TOKEN, DOCUMENT

Output Annotator Type: SENTIMENT

Python API: ViveknSentimentApproach Scala API: ViveknSentimentApproach Source: ViveknSentimentApproach
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

document = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

token = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

normalizer = Normalizer() \
    .setInputCols(["token"]) \
    .setOutputCol("normal")

vivekn = ViveknSentimentApproach() \
    .setInputCols(["document", "normal"]) \
    .setSentimentCol("train_sentiment") \
    .setOutputCol("result_sentiment")

finisher = Finisher() \
    .setInputCols(["result_sentiment"]) \
    .setOutputCols("final_sentiment")

pipeline = Pipeline().setStages([document, token, normalizer, vivekn, finisher])

training = spark.createDataFrame([
    ("I really liked this movie!", "positive"),
    ("The cast was horrible", "negative"),
    ("Never going to watch this again or recommend it to anyone", "negative"),
    ("It's a waste of time", "negative"),
    ("I loved the protagonist", "positive"),
    ("The music was really really good", "positive")
]).toDF("text", "train_sentiment")
pipelineModel = pipeline.fit(training)

data = spark.createDataFrame([
    ["I recommend this movie"],
    ["Dont waste your time!!!"]
]).toDF("text")
result = pipelineModel.transform(data)

result.select("final_sentiment").show(truncate=False)
+---------------+
|final_sentiment|
+---------------+
|[positive]     |
|[negative]     |
+---------------+
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.Normalizer
import com.johnsnowlabs.nlp.annotators.sda.vivekn.ViveknSentimentApproach
import com.johnsnowlabs.nlp.Finisher
import org.apache.spark.ml.Pipeline

val document = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val token = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val normalizer = new Normalizer()
  .setInputCols("token")
  .setOutputCol("normal")

val vivekn = new ViveknSentimentApproach()
  .setInputCols("document", "normal")
  .setSentimentCol("train_sentiment")
  .setOutputCol("result_sentiment")

val finisher = new Finisher()
  .setInputCols("result_sentiment")
  .setOutputCols("final_sentiment")

val pipeline = new Pipeline().setStages(Array(document, token, normalizer, vivekn, finisher))

val training = Seq(
  ("I really liked this movie!", "positive"),
  ("The cast was horrible", "negative"),
  ("Never going to watch this again or recommend it to anyone", "negative"),
  ("It's a waste of time", "negative"),
  ("I loved the protagonist", "positive"),
  ("The music was really really good", "positive")
).toDF("text", "train_sentiment")
val pipelineModel = pipeline.fit(training)

val data = Seq(
  "I recommend this movie",
  "Dont waste your time!!!"
).toDF("text")
val result = pipelineModel.transform(data)

result.select("final_sentiment").show(false)
+---------------+
|final_sentiment|
+---------------+
|[positive]     |
|[negative]     |
+---------------+

Word2Vec

Trains a Word2Vec model that creates vector representations of words in a text corpus.

The algorithm first constructs a vocabulary from the corpus and then learns vector representation of words in the vocabulary. The vector representation can be used as features in natural language processing and machine learning algorithms.

We use Word2Vec implemented in Spark ML. It uses skip-gram model in our implementation and a hierarchical softmax method to train the model. The variable names in the implementation match the original C implementation.

For instantiated/pretrained models, see Word2VecModel.

Sources :

For the original C implementation, see https://code.google.com/p/word2vec/

For the research paper, see Efficient Estimation of Word Representations in Vector Space and Distributed Representations of Words and Phrases and their Compositionality.

Input Annotator Types: TOKEN

Output Annotator Type: WORD_EMBEDDINGS

Python API: Word2VecApproach Scala API: Word2VecApproach Source: Word2VecApproach
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

embeddings = Word2VecApproach() \
    .setInputCols(["token"]) \
    .setOutputCol("embeddings")

pipeline = Pipeline() \
    .setStages([
      documentAssembler,
      tokenizer,
      embeddings
    ])

path = "sherlockholmes.txt"
dataset = spark.read.text(path).toDF("text")
pipelineModel = pipeline.fit(dataset)
import spark.implicits._
import com.johnsnowlabs.nlp.annotator.{Tokenizer, Word2VecApproach}
import com.johnsnowlabs.nlp.base.DocumentAssembler
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols(Array("document"))
  .setOutputCol("token")

val embeddings = new Word2VecApproach()
  .setInputCols("token")
  .setOutputCol("embeddings")

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    tokenizer,
    embeddings
  ))

val path = "src/test/resources/spell/sherlockholmes.txt"
val dataset = spark.sparkContext.textFile(path)
  .toDF("text")
val pipelineModel = pipeline.fit(dataset)

WordEmbeddings

Word Embeddings lookup annotator that maps tokens to vectors.

For instantiated/pretrained models, see WordEmbeddingsModel.

A custom token lookup dictionary for embeddings can be set with setStoragePath. Each line of the provided file needs to have a token, followed by their vector representation, delimited by a spaces.

...
are 0.39658191506190343 0.630968081620067 0.5393722253731201 0.8428180123359783
were 0.7535235923631415 0.9699218875629833 0.10397182122983872 0.11833962569383116
stress 0.0492683418305907 0.9415954572751959 0.47624463167525755 0.16790967216778263
induced 0.1535748762292387 0.33498936903209897 0.9235178224122094 0.1158772920395934
...

If a token is not found in the dictionary, then the result will be a zero vector of the same dimension. Statistics about the rate of converted tokens, can be retrieved with[WordEmbeddingsModel.withCoverageColumn and WordEmbeddingsModel.overallCoverage.

For extended examples of usage, see the Examples and the WordEmbeddingsTestSpec.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: WORD_EMBEDDINGS

Python API: WordEmbeddings Scala API: WordEmbeddings Source: WordEmbeddings
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
# In this example, the file `random_embeddings_dim4.txt` has the form of the content above.

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

embeddings = WordEmbeddings() \
    .setStoragePath("src/test/resources/random_embeddings_dim4.txt", ReadAs.TEXT) \
    .setStorageRef("glove_4d") \
    .setDimension(4) \
    .setInputCols(["document", "token"]) \
    .setOutputCol("embeddings")

embeddingsFinisher = EmbeddingsFinisher() \
    .setInputCols(["embeddings"]) \
    .setOutputCols("finished_embeddings") \
    .setOutputAsVector(True) \
    .setCleanAnnotations(False)

pipeline = Pipeline() \
    .setStages([
      documentAssembler,
      tokenizer,
      embeddings,
      embeddingsFinisher
    ])

data = spark.createDataFrame([["The patient was diagnosed with diabetes."]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(truncate=False)
+----------------------------------------------------------------------------------+
|result                                                                            |
+----------------------------------------------------------------------------------+
|[0.9439099431037903,0.4707513153553009,0.806300163269043,0.16176554560661316]     |
|[0.7966810464859009,0.5551124811172485,0.8861005902290344,0.28284206986427307]    |
|[0.025029370561242104,0.35177749395370483,0.052506182342767715,0.1887107789516449]|
|[0.08617766946554184,0.8399239182472229,0.5395117998123169,0.7864698767662048]    |
|[0.6599600911140442,0.16109347343444824,0.6041093468666077,0.8913561105728149]    |
|[0.5955275893211365,0.01899011991918087,0.4397728443145752,0.8911281824111938]    |
|[0.9840458631515503,0.7599489092826843,0.9417727589607239,0.8624503016471863]     |
+----------------------------------------------------------------------------------+
// In this example, the file `random_embeddings_dim4.txt` has the form of the content above.
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.embeddings.WordEmbeddings
import com.johnsnowlabs.nlp.util.io.ReadAs
import com.johnsnowlabs.nlp.EmbeddingsFinisher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols(Array("document"))
  .setOutputCol("token")

val embeddings = new WordEmbeddings()
  .setStoragePath("src/test/resources/random_embeddings_dim4.txt", ReadAs.TEXT)
  .setStorageRef("glove_4d")
  .setDimension(4)
  .setInputCols("document", "token")
  .setOutputCol("embeddings")

val embeddingsFinisher = new EmbeddingsFinisher()
  .setInputCols("embeddings")
  .setOutputCols("finished_embeddings")
  .setOutputAsVector(true)
  .setCleanAnnotations(false)

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    tokenizer,
    embeddings,
    embeddingsFinisher
  ))

val data = Seq("The patient was diagnosed with diabetes.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(false)
+----------------------------------------------------------------------------------+
|result                                                                            |
+----------------------------------------------------------------------------------+
|[0.9439099431037903,0.4707513153553009,0.806300163269043,0.16176554560661316]     |
|[0.7966810464859009,0.5551124811172485,0.8861005902290344,0.28284206986427307]    |
|[0.025029370561242104,0.35177749395370483,0.052506182342767715,0.1887107789516449]|
|[0.08617766946554184,0.8399239182472229,0.5395117998123169,0.7864698767662048]    |
|[0.6599600911140442,0.16109347343444824,0.6041093468666077,0.8913561105728149]    |
|[0.5955275893211365,0.01899011991918087,0.4397728443145752,0.8911281824111938]    |
|[0.9840458631515503,0.7599489092826843,0.9417727589607239,0.8624503016471863]     |
+----------------------------------------------------------------------------------+

WordSegmenter

Trains a WordSegmenter which tokenizes non-english or non-whitespace separated texts.

Many languages are not whitespace separated and their sentences are a concatenation of many symbols, like Korean, Japanese or Chinese. Without understanding the language, splitting the words into their corresponding tokens is impossible. The WordSegmenter is trained to understand these languages and split them into semantically correct parts.

This annotator is based on the paper Chinese Word Segmentation as Character Tagging [1]. Word segmentation is treated as a tagging problem. Each character is be tagged as on of four different labels: LL (left boundary), RR (right boundary), MM (middle) and LR (word by itself). The label depends on the position of the word in the sentence. LL tagged words will combine with the word on the right. Likewise, RR tagged words combine with words on the left. MM tagged words are treated as the middle of the word and combine with either side. LR tagged words are words by themselves.

Example (from [1], Example 3(a) (raw), 3(b) (tagged), 3(c) (translation)):

  • 上海 计划 到 本 世纪 末 实现 人均 国内 生产 总值 五千 美元
  • 上/LL 海/RR 计/LL 划/RR 到/LR 本/LR 世/LL 纪/RR 末/LR 实/LL 现/RR 人/LL 均/RR 国/LL 内/RR 生/LL 产/RR 总/LL 值/RR 五/LL 千/RR 美/LL 元/RR
  • Shanghai plans to reach the goal of 5,000 dollars in per capita GDP by the end of the century.

For instantiated/pretrained models, see WordSegmenterModel.

To train your own model, a training dataset consisting of Part-Of-Speech tags is required. The data has to be loaded into a dataframe, where the column is an Annotation of type "POS". This can be set with setPosColumn.

Tip: The helper class POS might be useful to read training data into data frames.

For extended examples of usage, see the Examples and the WordSegmenterTest.

References:

  • [1] Xue, Nianwen. “Chinese Word Segmentation as Character Tagging.” International Journal of Computational Linguistics & Chinese Language Processing, Volume 8, Number 1, February 2003: Special Issue on Word Formation and Chinese Language Processing, 2003, pp. 29-48. ACLWeb, https://aclanthology.org/O03-4002.

Input Annotator Types: DOCUMENT

Output Annotator Type: TOKEN

Python API: WordSegmenterApproach Scala API: WordSegmenterApproach Source: WordSegmenterApproach
Show Example
# In this example, `"chinese_train.utf8"` is in the form of
#
# 十|LL 四|RR 不|LL 是|RR 四|LL 十|RR
#
# and is loaded with the `POS` class to create a dataframe of `"POS"` type Annotations.

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

wordSegmenter = WordSegmenterApproach() \
    .setInputCols(["document"]) \
    .setOutputCol("token") \
    .setPosColumn("tags") \
    .setNIterations(5)

pipeline = Pipeline().setStages([
    documentAssembler,
    wordSegmenter
])

trainingDataSet = POS().readDataset(
    spark,
    "src/test/resources/word-segmenter/chinese_train.utf8"
)

pipelineModel = pipeline.fit(trainingDataSet)
// In this example, `"chinese_train.utf8"` is in the form of
//
// 十|LL 四|RR 不|LL 是|RR 四|LL 十|RR
//
// and is loaded with the `POS` class to create a dataframe of `"POS"` type Annotations.

import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.ws.WordSegmenterApproach
import com.johnsnowlabs.nlp.training.POS
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val wordSegmenter = new WordSegmenterApproach()
  .setInputCols("document")
  .setOutputCol("token")
  .setPosColumn("tags")
  .setNIterations(5)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  wordSegmenter
))

val trainingDataSet = POS().readDataset(
  ResourceHelper.spark,
  "src/test/resources/word-segmenter/chinese_train.utf8"
)

val pipelineModel = pipeline.fit(trainingDataSet)

YakeKeywordExtraction

Yake is an Unsupervised, Corpus-Independent, Domain and Language-Independent and Single-Document keyword extraction algorithm.

Extracting keywords from texts has become a challenge for individuals and organizations as the information grows in complexity and size. The need to automate this task so that text can be processed in a timely and adequate manner has led to the emergence of automatic keyword extraction tools. Yake is a novel feature-based system for multi-lingual keyword extraction, which supports texts of different sizes, domain or languages. Unlike other approaches, Yake does not rely on dictionaries nor thesauri, neither is trained against any corpora. Instead, it follows an unsupervised approach which builds upon features extracted from the text, making it thus applicable to documents written in different languages without the need for further knowledge. This can be beneficial for a large number of tasks and a plethora of situations where access to training corpora is either limited or restricted. The algorithm makes use of the position of a sentence and token. Therefore, to use the annotator, the text should be first sent through a Sentence Boundary Detector and then a tokenizer.

Note that each keyword will be given a keyword score greater than 0 (The lower the score better the keyword). Therefore to filter the keywords, an upper bound for the score can be set with setThreshold.

For extended examples of usage, see the Examples and the YakeTestSpec.

Sources :

Campos, R., Mangaravite, V., Pasquali, A., Jatowt, A., Jorge, A., Nunes, C. and Jatowt, A. (2020). YAKE! Keyword Extraction from Single Documents using Multiple Local Features. In Information Sciences Journal. Elsevier, Vol 509, pp 257-289

Paper abstract:

As the amount of generated information grows, reading and summarizing texts of large collections turns into a challenging task. Many documents do not come with descriptive terms, thus requiring humans to generate keywords on-the-fly. The need to automate this kind of task demands the development of keyword extraction systems with the ability to automatically identify keywords within the text. One approach is to resort to machine-learning algorithms. These, however, depend on large annotated text corpora, which are not always available. An alternative solution is to consider an unsupervised approach. In this article, we describe YAKE!, a light-weight unsupervised automatic keyword extraction method which rests on statistical text features extracted from single documents to select the most relevant keywords of a text. Our system does not need to be trained on a particular set of documents, nor does it depend on dictionaries, external corpora, text size, language, or domain. To demonstrate the merits and significance of YAKE!, we compare it against ten state-of-the-art unsupervised approaches and one supervised method. Experimental results carried out on top of twenty datasets show that YAKE! significantly outperforms other unsupervised methods on texts of different sizes, languages, and domains.

Input Annotator Types: TOKEN

Output Annotator Type: CHUNK

Python API: YakeKeywordExtraction Scala API: YakeKeywordExtraction Source: YakeKeywordExtraction
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

token = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token") \
    .setContextChars(["(", "]", "?", "!", ".", ","])

keywords = YakeKeywordExtraction() \
    .setInputCols(["token"]) \
    .setOutputCol("keywords") \
    .setThreshold(0.6) \
    .setMinNGrams(2) \
    .setNKeywords(10)

pipeline = Pipeline().setStages([
    documentAssembler,
    sentenceDetector,
    token,
    keywords
])

data = spark.createDataFrame([[
    "Sources tell us that Google is acquiring Kaggle, a platform that hosts data science and machine learning competitions. Details about the transaction remain somewhat vague, but given that Google is hosting its Cloud Next conference in San Francisco this week, the official announcement could come as early as tomorrow. Reached by phone, Kaggle co-founder CEO Anthony Goldbloom declined to deny that the acquisition is happening. Google itself declined 'to comment on rumors'. Kaggle, which has about half a million data scientists on its platform, was founded by Goldbloom  and Ben Hamner in 2010. The service got an early start and even though it has a few competitors like DrivenData, TopCoder and HackerRank, it has managed to stay well ahead of them by focusing on its specific niche. The service is basically the de facto home for running data science and machine learning competitions. With Kaggle, Google is buying one of the largest and most active communities for data scientists - and with that, it will get increased mindshare in this community, too (though it already has plenty of that thanks to Tensorflow and other projects). Kaggle has a bit of a history with Google, too, but that's pretty recent. Earlier this month, Google and Kaggle teamed up to host a $100,000 machine learning competition around classifying YouTube videos. That competition had some deep integrations with the Google Cloud Platform, too. Our understanding is that Google will keep the service running - likely under its current name. While the acquisition is probably more about Kaggle's community than technology, Kaggle did build some interesting tools for hosting its competition and 'kernels', too. On Kaggle, kernels are basically the source code for analyzing data sets and developers can share this code on the platform (the company previously called them 'scripts'). Like similar competition-centric sites, Kaggle also runs a job board, too. It's unclear what Google will do with that part of the service. According to Crunchbase, Kaggle raised $12.5 million (though PitchBook says it's $12.75) since its   launch in 2010. Investors in Kaggle include Index Ventures, SV Angel, Max Levchin, NaRavikant, Google chie economist Hal Varian, Khosla Ventures and Yuri Milner"
]]).toDF("text")
result = pipeline.fit(data).transform(data)

# combine the result and score (contained in keywords.metadata)
scores = result \
    .selectExpr("explode(arrays_zip(keywords.result, keywords.metadata)) as resultTuples") \
    .selectExpr("resultTuples['0'] as keyword", "resultTuples['1'].score as score")

# Order ascending, as lower scores means higher importance
scores.orderBy("score").show(5, truncate = False)
+---------------------+-------------------+
|keyword              |score              |
+---------------------+-------------------+
|google cloud         |0.32051516486864573|
|google cloud platform|0.37786450577630676|
|ceo anthony goldbloom|0.39922830978423146|
|san francisco        |0.40224744669493756|
|anthony goldbloom    |0.41584827825302534|
+---------------------+-------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.{SentenceDetector, Tokenizer}
import com.johnsnowlabs.nlp.annotators.keyword.yake.YakeKeywordExtraction
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val token = new Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")
  .setContextChars(Array("(", ")", "?", "!", ".", ","))

val keywords = new YakeKeywordExtraction()
  .setInputCols("token")
  .setOutputCol("keywords")
  .setThreshold(0.6f)
  .setMinNGrams(2)
  .setNKeywords(10)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  token,
  keywords
))

val data = Seq(
  "Sources tell us that Google is acquiring Kaggle, a platform that hosts data science and machine learning competitions. Details about the transaction remain somewhat vague, but given that Google is hosting its Cloud Next conference in San Francisco this week, the official announcement could come as early as tomorrow. Reached by phone, Kaggle co-founder CEO Anthony Goldbloom declined to deny that the acquisition is happening. Google itself declined 'to comment on rumors'. Kaggle, which has about half a million data scientists on its platform, was founded by Goldbloom  and Ben Hamner in 2010. The service got an early start and even though it has a few competitors like DrivenData, TopCoder and HackerRank, it has managed to stay well ahead of them by focusing on its specific niche. The service is basically the de facto home for running data science and machine learning competitions. With Kaggle, Google is buying one of the largest and most active communities for data scientists - and with that, it will get increased mindshare in this community, too (though it already has plenty of that thanks to Tensorflow and other projects). Kaggle has a bit of a history with Google, too, but that's pretty recent. Earlier this month, Google and Kaggle teamed up to host a $100,000 machine learning competition around classifying YouTube videos. That competition had some deep integrations with the Google Cloud Platform, too. Our understanding is that Google will keep the service running - likely under its current name. While the acquisition is probably more about Kaggle's community than technology, Kaggle did build some interesting tools for hosting its competition and 'kernels', too. On Kaggle, kernels are basically the source code for analyzing data sets and developers can share this code on the platform (the company previously called them 'scripts'). Like similar competition-centric sites, Kaggle also runs a job board, too. It's unclear what Google will do with that part of the service. According to Crunchbase, Kaggle raised $12.5 million (though PitchBook says it's $12.75) since its   launch in 2010. Investors in Kaggle include Index Ventures, SV Angel, Max Levchin, Naval Ravikant, Google chief economist Hal Varian, Khosla Ventures and Yuri Milner"
).toDF("text")
val result = pipeline.fit(data).transform(data)

// combine the result and score (contained in keywords.metadata)
val scores = result
  .selectExpr("explode(arrays_zip(keywords.result, keywords.metadata)) as resultTuples")
  .select($"resultTuples.0" as "keyword", $"resultTuples.1.score")

// Order ascending, as lower scores means higher importance
scores.orderBy("score").show(5, truncate = false)
+---------------------+-------------------+
|keyword              |score              |
+---------------------+-------------------+
|google cloud         |0.32051516486864573|
|google cloud platform|0.37786450577630676|
|ceo anthony goldbloom|0.39922830978423146|
|san francisco        |0.40224744669493756|
|anthony goldbloom    |0.41584827825302534|
+---------------------+-------------------+
Last updated