How to read this section
All annotators in Spark NLP share a common interface, this is:
- Annotation:
Annotation(annotatorType, begin, end, result, meta-data, embeddings)
- AnnotatorType: some annotators share a type. This is not only
figurative, but also tells about the structure of the
metadata
map in the Annotation. This is the one referred in the input and output of annotators. - Inputs: Represents how many and which annotator types are expected
in
setInputCols()
. These are column names of output of other annotators in the DataFrames. - Output Represents the type of the output in the column
setOutputCol()
.
There are two types of Annotators:
- Approach: AnnotatorApproach extend Estimators, which are meant to be trained through
fit()
- Model: AnnotatorModel extend from Transformers, which are meant to transform DataFrames through
transform()
Model
suffix is explicitly stated when the annotator is the result of a training process. Some annotators, such as Tokenizer are transformers, but do not contain the word Model since they are not trained annotators.
Model
annotators have a pretrained()
on it’s static object, to retrieve the public pre-trained version of a model.
pretrained(name, language, extra_location)
-> by default, pre-trained will bring a default model, sometimes we offer more than one model, in this case, you may have to use name, language or extra location to download them.
Available Annotators
Annotator | Description | Version |
---|---|---|
AutoGGUFModel | Annotator that uses the llama.cpp library to generate text completions with large language models. | Opensource |
BGEEmbeddings | Sentence embeddings using BGE. | Opensource |
BigTextMatcher | Annotator to match exact phrases (by token) provided in a file against a Document. | Opensource |
Chunk2Doc | Converts a CHUNK type column back into DOCUMENT . Useful when trying to re-tokenize or do further analysis on a CHUNK result. |
Opensource |
ChunkEmbeddings | This annotator utilizes WordEmbeddings, BertEmbeddings etc. to generate chunk embeddings from either Chunker, NGramGenerator, or NerConverter outputs. | Opensource |
ChunkTokenizer | Tokenizes and flattens extracted NER chunks. | Opensource |
Chunker | This annotator matches a pattern of part-of-speech tags in order to return meaningful phrases from document. | Opensource |
ClassifierDL | ClassifierDL for generic Multi-class Text Classification. | Opensource |
ContextSpellChecker | Implements a deep-learning based Noisy Channel Model Spell Algorithm. | Opensource |
Date2Chunk | Converts DATE type Annotations to CHUNK type. |
Opensource |
DateMatcher | Matches standard date formats into a provided format. | Opensource |
DependencyParser | Unlabeled parser that finds a grammatical relation between two words in a sentence. | Opensource |
Doc2Chunk | Converts DOCUMENT type annotations into CHUNK type with the contents of a chunkCol . |
Opensource |
Doc2Vec | Word2Vec model that creates vector representations of words in a text corpus. | Opensource |
DocumentAssembler | Prepares data into a format that is processable by Spark NLP. This is the entry point for every Spark NLP pipeline. | Opensource |
DocumentCharacterTextSplitter | Annotator which splits large documents into chunks of roughly given size. | Opensource |
DocumentNormalizer | Annotator which normalizes raw text from tagged text, e.g. scraped web pages or xml documents, from document type columns into Sentence. | Opensource |
DocumentSimilarityRanker | Annotator that uses LSH techniques present in Spark ML lib to execute approximate nearest neighbors search on top of sentence embeddings. | Opensource |
DocumentTokenSplitter | Annotator that splits large documents into smaller documents based on the number of tokens in the text. | Opensource |
EntityRuler | Fits an Annotator to match exact strings or regex patterns provided in a file against a Document and assigns them an named entity. | Opensource |
EmbeddingsFinisher | Extracts embeddings from Annotations into a more easily usable form. | Opensource |
Finisher | Converts annotation results into a format that easier to use. It is useful to extract the results from Spark NLP Pipelines. | Opensource |
GraphExtraction | Extracts a dependency graph between entities. | Opensource |
GraphFinisher | Helper class to convert the knowledge graph from GraphExtraction into a generic format, such as RDF. | Opensource |
ImageAssembler | Prepares images read by Spark into a format that is processable by Spark NLP. | Opensource |
LanguageDetectorDL | Language Identification and Detection by using CNN and RNN architectures in TensorFlow. | Opensource |
Lemmatizer | Finds lemmas out of words with the objective of returning a base dictionary word. | Opensource |
MultiClassifierDL | Multi-label Text Classification. | Opensource |
MultiDateMatcher | Matches standard date formats into a provided format. | Opensource |
MultiDocumentAssembler | Prepares data into a format that is processable by Spark NLP. | Opensource |
NGramGenerator | A feature transformer that converts the input array of strings (annotatorType TOKEN) into an array of n-grams (annotatorType CHUNK). | Opensource |
NerConverter | Converts a IOB or IOB2 representation of NER to a user-friendly one, by associating the tokens of recognized entities and their label. | Opensource |
NerCrf | Extracts Named Entities based on a CRF Model. | Opensource |
NerDL | This Named Entity recognition annotator is a generic NER model based on Neural Networks. | Opensource |
NerOverwriter | Overwrites entities of specified strings. | Opensource |
Normalizer | Removes all dirty characters from text following a regex pattern and transforms words based on a provided dictionary. | Opensource |
NorvigSweeting Spellchecker | Retrieves tokens and makes corrections automatically if not found in an English dictionary. | Opensource |
POSTagger (Part of speech tagger) | Averaged Perceptron model to tag words part-of-speech. | Opensource |
PromptAssembler | Assembles a sequence of messages into a single string using a template. | Opensource |
RecursiveTokenizer | Tokenizes raw text recursively based on a handful of definable rules. | Opensource |
RegexMatcher | Uses rules to match a set of regular expressions and associate them with a provided identifier. | Opensource |
RegexTokenizer | A tokenizer that splits text by a regex pattern. | Opensource |
SentenceDetector | Annotator that detects sentence boundaries using regular expressions. | Opensource |
SentenceDetectorDL | Detects sentence boundaries using a deep learning approach. | Opensource |
SentenceEmbeddings | Converts the results from WordEmbeddings, BertEmbeddings, or ElmoEmbeddings into sentence or document embeddings by either summing up or averaging all the word embeddings in a sentence or a document (depending on the inputCols). | Opensource |
SentimentDL | Annotator for multi-class sentiment analysis. | Opensource |
SentimentDetector | Rule based sentiment detector, which calculates a score based on predefined keywords. | Opensource |
Stemmer | Returns hard-stems out of words with the objective of retrieving the meaningful part of the word. | Opensource |
StopWordsCleaner | This annotator takes a sequence of strings (e.g. the output of a Tokenizer, Normalizer, Lemmatizer, and Stemmer) and drops all the stop words from the input sequences. | Opensource |
SymmetricDelete Spellchecker | Symmetric Delete spelling correction algorithm. | Opensource |
TextMatcher | Matches exact phrases (by token) provided in a file against a Document. | Opensource |
Token2Chunk | Converts TOKEN type Annotations to CHUNK type. |
Opensource |
TokenAssembler | This transformer reconstructs a DOCUMENT type annotation from tokens, usually after these have been normalized, lemmatized, normalized, spell checked, etc, in order to use this document annotation in further annotators. | Opensource |
Tokenizer | Tokenizes raw text into word pieces, tokens. Identifies tokens with tokenization open standards. A few rules will help customizing it if defaults do not fit user needs. | Opensource |
TypedDependencyParser | Labeled parser that finds a grammatical relation between two words in a sentence. | Opensource |
ViveknSentiment | Sentiment analyser inspired by the algorithm by Vivek Narayanan. | Opensource |
WordEmbeddings | Word Embeddings lookup annotator that maps tokens to vectors. | Opensource |
Word2Vec | Word2Vec model that creates vector representations of words in a text corpus. | Opensource |
WordSegmenter | Tokenizes non-english or non-whitespace separated texts. | Opensource |
YakeKeywordExtraction | Unsupervised, Corpus-Independent, Domain and Language-Independent and Single-Document keyword extraction. | Opensource |
Available Transformers
Additionally, these transformers are available.
Transformer | Description | Version |
---|---|---|
AlbertEmbeddings | ALBERT: A Lite BERT for Self-supervised Learning of Language Representations | Opensource |
AlbertForQuestionAnswering | AlbertForQuestionAnswering can load ALBERT Models with a span classification head on top for extractive question-answering tasks like SQuAD. | Opensource |
AlbertForTokenClassification | AlbertForTokenClassification can load ALBERT Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. | Opensource |
AlbertForSequenceClassification | AlbertForSequenceClassification can load ALBERT Models with sequence classification/regression head on top e.g. for multi-class document classification tasks. | Opensource |
BartForZeroShotClassification | BartForZeroShotClassification using a ModelForSequenceClassification trained on NLI (natural language inference) tasks. |
Opensource |
BartTransformer | BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension Transformer | Opensource |
BertForQuestionAnswering | BertForQuestionAnswering can load Bert Models with a span classification head on top for extractive question-answering tasks like SQuAD. | Opensource |
BertForSequenceClassification | Bert Models with sequence classification/regression head on top. | Opensource |
BertForTokenClassification | BertForTokenClassification can load Bert Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. | Opensource |
BertForZeroShotClassification | BertForZeroShotClassification using a ModelForSequenceClassification trained on NLI (natural language inference) tasks. | Opensource |
BertSentenceEmbeddings | Sentence-level embeddings using BERT. BERT (Bidirectional Encoder Representations from Transformers) provides dense vector representations for natural language by using a deep, pre-trained neural network with the Transformer architecture. | Opensource |
CamemBertEmbeddings | CamemBert is based on Facebook’s RoBERTa model released in 2019. | Opensource |
CamemBertForQuestionAnswering | CamemBertForQuestionAnswering can load CamemBERT Models with a span classification head on top for extractive question-answering tasks like SQuAD | Opensource |
CamemBertForSequenceClassification | amemBertForSequenceClassification can load CamemBERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. | Opensource |
CamemBertForTokenClassification | CamemBertForTokenClassification can load CamemBERT Models with a token classification head on top | Opensource |
CLIPForZeroShotClassification | Zero Shot Image Classifier based on CLIP | Opensource |
ConvNextForImageClassification | ConvNextForImageClassification is an image classifier based on ConvNet models | Opensource |
DeBertaEmbeddings | DeBERTa builds on RoBERTa with disentangled attention and enhanced mask decoder training with half of the data used in RoBERTa. | Opensource |
DeBertaForQuestionAnswering | DeBertaForQuestionAnswering can load DeBERTa Models with a span classification head on top for extractive question-answering tasks like SQuAD. | Opensource |
DeBertaForSequenceClassification | DeBertaForSequenceClassification can load DeBerta v2 & v3 Models with sequence classification/regression head on top. | Opensource |
DeBertaForTokenClassification | DeBertaForTokenClassification can load DeBERTA Models v2 and v3 with a token classification head on top. | Opensource |
DistilBertEmbeddings | DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base. | Opensource |
DistilBertForQuestionAnswering | DistilBertForQuestionAnswering can load DistilBert Models with a span classification head on top for extractive question-answering tasks like SQuAD. | Opensource |
DistilBertForSequenceClassification | DistilBertForSequenceClassification can load DistilBERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. | Opensource |
DistilBertForTokenClassification | DistilBertForTokenClassification can load DistilBERT Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. | Opensource |
DistilBertForZeroShotClassification | DistilBertForZeroShotClassification using a ModelForSequenceClassification trained on NLI (natural language inference) tasks. | Opensource |
E5Embeddings | Sentence embeddings using E5, an instruction-finetuned text embedding model that can generate text embeddings tailored to any task. | Opensource |
ElmoEmbeddings | Word embeddings from ELMo (Embeddings from Language Models), a language model trained on the 1 Billion Word Benchmark. | Opensource |
GPT2Transformer | GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset of 8 million web pages. | Opensource |
HubertForCTC | Hubert Model with a language modeling head on top for Connectionist Temporal Classification (CTC). | Opensource |
InstructorEmbeddings | Sentence embeddings using INSTRUCTOR. | Opensource |
LongformerEmbeddings | Longformer is a BERT-like model started from the RoBERTa checkpoint and pretrained for MLM on long documents. | Opensource |
LongformerForQuestionAnswering | LongformerForQuestionAnswering can load Longformer Models with a span classification head on top for extractive question-answering tasks like SQuAD. | Opensource |
LongformerForSequenceClassification | LongformerForSequenceClassification can load Longformer Models with sequence classification/regression head on top e.g. for multi-class document classification tasks. | Opensource |
LongformerForTokenClassification | LongformerForTokenClassification can load Longformer Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. | Opensource |
MarianTransformer | Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. | Opensource |
MPNetEmbeddings | Sentence embeddings using MPNet. | Opensource |
MPNetForQuestionAnswering | MPNet Models with a span classification head on top for extractive question-answering tasks like SQuAD. | Opensource |
MPNetForSequenceClassification | MPNet Models with sequence classification/regression head on top e.g. for multi-class document classification tasks. | Opensource |
OpenAICompletion | Transformer that makes a request for OpenAI Completion API for each executor. | Opensource |
RoBertaEmbeddings | RoBERTa: A Robustly Optimized BERT Pretraining Approach | Opensource |
RoBertaForQuestionAnswering | RoBertaForQuestionAnswering can load RoBERTa Models with a span classification head on top for extractive question-answering tasks like SQuAD. | Opensource |
RoBertaForSequenceClassification | RoBertaForSequenceClassification can load RoBERTa Models with sequence classification/regression head on top e.g. for multi-class document classification tasks. | Opensource |
RoBertaForTokenClassification | RoBertaForTokenClassification can load RoBERTa Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. | Opensource |
RoBertaForZeroShotClassification | RoBertaForZeroShotClassification using a ModelForSequenceClassification trained on NLI (natural language inference) tasks. |
Opensource |
RoBertaForZeroShotClassification | RoBertaForZeroShotClassification using a ModelForSequenceClassification trained on NLI (natural language inference) tasks. | Opensource |
RoBertaSentenceEmbeddings | Sentence-level embeddings using RoBERTa. | Opensource |
SpanBertCoref | A coreference resolution model based on SpanBert. | Opensource |
SwinForImageClassification | SwinImageClassification is an image classifier based on Swin. | Opensource |
T5Transformer | T5 reconsiders all NLP tasks into a unified text-to-text-format where the input and output are always text strings, in contrast to BERT-style models that can only output either a class label or a span of the input. | Opensource |
TapasForQuestionAnswering | TapasForQuestionAnswering is an implementation of TaPas - a BERT-based model specifically designed for answering questions about tabular data. | Opensource |
UAEEmbeddings | Sentence embeddings using Universal AnglE Embedding (UAE). | Opensource |
UniversalSentenceEncoder | The Universal Sentence Encoder encodes text into high dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks. | Opensource |
VisionEncoderDecoderForImageCaptioning | VisionEncoderDecoder model that converts images into text captions. | Opensource |
ViTForImageClassification | Vision Transformer (ViT) for image classification. | Opensource |
Wav2Vec2ForCTC | Wav2Vec2 Model with a language modeling head on top for Connectionist Temporal Classification (CTC). | Opensource |
WhisperForCTC | Whisper Model with a language modeling head on top for Connectionist Temporal Classification (CTC). | Opensource |
XlmRoBertaEmbeddings | XlmRoBerta is a large multi-lingual language model, trained on 2.5TB of filtered CommonCrawl | Opensource |
XlmRoBertaForQuestionAnswering | XlmRoBertaForQuestionAnswering can load XLM-RoBERTa Models with a span classification head on top for extractive question-answering tasks like SQuAD. | Opensource |
XlmRoBertaForSequenceClassification | XlmRoBertaForSequenceClassification can load XLM-RoBERTa Models with sequence classification/regression head on top e.g. for multi-class document classification tasks. | Opensource |
XlmRoBertaForTokenClassification | XlmRoBertaForTokenClassification can load XLM-RoBERTa Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. | Opensource |
XlmRoBertaForZeroShotClassification | XlmRoBertaForZeroShotClassification using a ModelForSequenceClassification trained on NLI (natural language inference) tasks. |
Opensource |
XlmRoBertaSentenceEmbeddings | Sentence-level embeddings using XLM-RoBERTa. | Opensource |
XlnetEmbeddings | XLNet is a new unsupervised language representation learning method based on a novel generalized permutation language modeling objective. | Opensource |
XlnetForTokenClassification | XlnetForTokenClassification can load XLNet Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. | Opensource |
XlnetForSequenceClassification | XlnetForSequenceClassification can load XLNet Models with sequence classification/regression head on top e.g. for multi-class document classification tasks. | Opensource |
ZeroShotNer | ZeroShotNerModel implements zero shot named entity recognition by utilizing RoBERTa transformer models fine tuned on a question answering task. | Opensource |
AutoGGUFModel
Annotator that uses the llama.cpp library to generate text completions with large language models.
For settable parameters, and their explanations, see HasLlamaCppProperties and refer to the llama.cpp documentation of server.cpp for more information.
If the parameters are not set, the annotator will default to use the parameters provided by the model.
Pretrained models can be loaded with pretrained
of the companion object:
val autoGGUFModel = AutoGGUFModel.pretrained()
.setInputCols("document")
.setOutputCol("completions")
The default model is "phi3.5_mini_4k_instruct_q4_gguf"
, if no name is provided.
For available pretrained models please see the Models Hub.
For extended examples of usage, see the AutoGGUFModelTest and the example notebook.
Note: To use GPU inference with this annotator, make sure to use the Spark NLP GPU package and set
the number of GPU layers with the setNGpuLayers
method.
When using larger models, we recommend adjusting GPU usage with setNCtx
and setNGpuLayers
according to your hardware to avoid out-of-memory errors.
Input Annotator Types: DOCUMENT
Output Annotator Type: DOCUMENT
Python API: AutoGGUFModel | Scala API: AutoGGUFModel | Source: AutoGGUFModel |
Show Example
>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> document = DocumentAssembler() \
... .setInputCol("text") \
... .setOutputCol("document")
>>> autoGGUFModel = AutoGGUFModel.pretrained() \
... .setInputCols(["document"]) \
... .setOutputCol("completions") \
... .setBatchSize(4) \
... .setNPredict(20) \
... .setNGpuLayers(99) \
... .setTemperature(0.4) \
... .setTopK(40) \
... .setTopP(0.9) \
... .setPenalizeNl(True)
>>> pipeline = Pipeline().setStages([document, autoGGUFModel])
>>> data = spark.createDataFrame([["Hello, I am a"]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.select("completions").show(truncate = False)
+-----------------------------------------------------------------------------------------------------------------------------------+
|completions |
+-----------------------------------------------------------------------------------------------------------------------------------+
|[{document, 0, 78, new user. I am currently working on a project and I need to create a list of , {prompt -> Hello, I am a}, []}]|
+-----------------------------------------------------------------------------------------------------------------------------------+
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline
import spark.implicits._
val document = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val autoGGUFModel = AutoGGUFModel
.pretrained()
.setInputCols("document")
.setOutputCol("completions")
.setBatchSize(4)
.setNPredict(20)
.setNGpuLayers(99)
.setTemperature(0.4f)
.setTopK(40)
.setTopP(0.9f)
.setPenalizeNl(true)
val pipeline = new Pipeline().setStages(Array(document, autoGGUFModel))
val data = Seq("Hello, I am a").toDF("text")
val result = pipeline.fit(data).transform(data)
result.select("completions").show(truncate = false)
+-----------------------------------------------------------------------------------------------------------------------------------+
|completions |
+-----------------------------------------------------------------------------------------------------------------------------------+
|[{document, 0, 78, new user. I am currently working on a project and I need to create a list of , {prompt -> Hello, I am a}, []}]|
+-----------------------------------------------------------------------------------------------------------------------------------+
BGEEmbeddings
Sentence embeddings using BGE.
BGE, or BAAI General Embeddings, a model that can map any text to a low-dimensional dense vector which can be used for tasks like retrieval, classification, clustering, or semantic search.
Note that this annotator is only supported for Spark Versions 3.4 and up.
Pretrained models can be loaded with pretrained
of the companion object:
val embeddings = BGEEmbeddings.pretrained()
.setInputCols("document")
.setOutputCol("embeddings")
The default model is "bge_base"
, if no name is provided.
For available pretrained models please see the Models Hub.
For extended examples of usage, see BGEEmbeddingsTestSpec.
Sources :
C-Pack: Packaged Resources To Advance General Chinese Embedding
Paper abstract
We introduce C-Pack, a package of resources that significantly advance the field of general Chinese embeddings. C-Pack includes three critical resources. 1) C-MTEB is a comprehensive benchmark for Chinese text embeddings covering 6 tasks and 35 datasets. 2) C-MTP is a massive text embedding dataset curated from labeled and unlabeled Chinese corpora for training embedding models. 3) C-TEM is a family of embedding models covering multiple sizes. Our models outperform all prior Chinese text embeddings on C-MTEB by up to +10% upon the time of the release. We also integrate and optimize the entire suite of training methods for C-TEM. Along with our resources on general Chinese embedding, we release our data and models for English text embeddings. The English models achieve stateof-the-art performance on the MTEB benchmark; meanwhile, our released English data is 2 times larger than the Chinese data. All these resources are made publicly available at https://github.com/FlagOpen/FlagEmbedding.
Input Annotator Types: DOCUMENT
Output Annotator Type: SENTENCE_EMBEDDINGS
Python API: BGEEmbeddings | Scala API: BGEEmbeddings | Source: BGEEmbeddings |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
embeddings = BGEEmbeddings.pretrained() \
.setInputCols(["document"]) \
.setOutputCol("bge_embeddings")
embeddingsFinisher = EmbeddingsFinisher() \
.setInputCols(["bge_embeddings"]) \
.setOutputCols("finished_embeddings") \
.setOutputAsVector(True)
pipeline = Pipeline().setStages([
documentAssembler,
embeddings,
embeddingsFinisher
])
data = spark.createDataFrame([["query: how much protein should a female eat",
"passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day." + \
"But, as you can see from this chart, you'll need to increase that if you're expecting or training for a" + \
"marathon. Check out the chart below to see how much protein you should be eating each day.",
]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
| result|
+--------------------------------------------------------------------------------+
|[[8.0190285E-4, -0.005974853, -0.072875895, 0.007944068, 0.026059335, -0.0080...|
|[[0.050514214, 0.010061974, -0.04340176, -0.020937217, 0.05170225, 0.01157857...|
+--------------------------------------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.embeddings.BGEEmbeddings
import com.johnsnowlabs.nlp.EmbeddingsFinisher
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val embeddings = BGEEmbeddings.pretrained("bge_base", "en")
.setInputCols("document")
.setOutputCol("bge_embeddings")
val embeddingsFinisher = new EmbeddingsFinisher()
.setInputCols("bge_embeddings")
.setOutputCols("finished_embeddings")
.setOutputAsVector(true)
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
embeddings,
embeddingsFinisher
))
val data = Seq("query: how much protein should a female eat",
"passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day." +
But, as you can see from this chart, you'll need to increase that if you're expecting or training for a" +
marathon. Check out the chart below to see how much protein you should be eating each day."
).toDF("text")
val result = pipeline.fit(data).transform(data)
result.selectExpr("explode(finished_embeddings) as result").show(1, 80)
+--------------------------------------------------------------------------------+
| result|
+--------------------------------------------------------------------------------+
|[[8.0190285E-4, -0.005974853, -0.072875895, 0.007944068, 0.026059335, -0.0080...|
|[[0.050514214, 0.010061974, -0.04340176, -0.020937217, 0.05170225, 0.01157857...|
+--------------------------------------------------------------------------------+
BigTextMatcher
Annotator to match exact phrases (by token) provided in a file against a Document.
A text file of predefined phrases must be provided with setStoragePath
.
In contrast to the normal TextMatcher
, the BigTextMatcher
is designed for large corpora.
For extended examples of usage, see the BigTextMatcherTestSpec.
Input Annotator Types: DOCUMENT, TOKEN
Output Annotator Type: CHUNK
Python API: BigTextMatcher | Scala API: BigTextMatcher | Source: BigTextMatcher |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
# In this example, the entities file is of the form
#
# ...
# dolore magna aliqua
# lorem ipsum dolor. sit
# laborum
# ...
#
# where each line represents an entity phrase to be extracted.
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
data = spark.createDataFrame([["Hello dolore magna aliqua. Lorem ipsum dolor. sit in laborum"]]).toDF("text")
entityExtractor = BigTextMatcher() \
.setInputCols("document", "token") \
.setStoragePath("src/test/resources/entity-extractor/test-phrases.txt", ReadAs.TEXT) \
.setOutputCol("entity") \
.setCaseSensitive(False)
pipeline = Pipeline().setStages([documentAssembler, tokenizer, entityExtractor])
results = pipeline.fit(data).transform(data)
results.selectExpr("explode(entity)").show(truncate=False)
+--------------------------------------------------------------------+
|col |
+--------------------------------------------------------------------+
|[chunk, 6, 24, dolore magna aliqua, [sentence -> 0, chunk -> 0], []]|
|[chunk, 53, 59, laborum, [sentence -> 0, chunk -> 1], []] |
+--------------------------------------------------------------------+
// In this example, the entities file is of the form
//
// ...
// dolore magna aliqua
// lorem ipsum dolor. sit
// laborum
// ...
//
// where each line represents an entity phrase to be extracted.
import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.Tokenizer
import com.johnsnowlabs.nlp.annotator.BigTextMatcher
import com.johnsnowlabs.nlp.util.io.ReadAs
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val data = Seq("Hello dolore magna aliqua. Lorem ipsum dolor. sit in laborum").toDF("text")
val entityExtractor = new BigTextMatcher()
.setInputCols("document", "token")
.setStoragePath("src/test/resources/entity-extractor/test-phrases.txt", ReadAs.TEXT)
.setOutputCol("entity")
.setCaseSensitive(false)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, entityExtractor))
val results = pipeline.fit(data).transform(data)
results.selectExpr("explode(entity)").show(false)
+--------------------------------------------------------------------+
|col |
+--------------------------------------------------------------------+
|[chunk, 6, 24, dolore magna aliqua, [sentence -> 0, chunk -> 0], []]|
|[chunk, 53, 59, laborum, [sentence -> 0, chunk -> 1], []] |
+--------------------------------------------------------------------+
Chunk2Doc
Converts a CHUNK
type column back into DOCUMENT
. Useful when trying to re-tokenize or do further analysis on a
CHUNK
result.
Input Annotator Types: CHUNK
Output Annotator Type: DOCUMENT
Python API: Chunk2Doc | Scala API: Chunk2Doc | Source: Chunk2Doc |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
from sparknlp.pretrained import PretrainedPipeline
# Location entities are extracted and converted back into `DOCUMENT` type for further processing
data = spark.createDataFrame([[1, "New York and New Jersey aren't that far apart actually."]]).toDF("id", "text")
# Extracts Named Entities amongst other things
pipeline = PretrainedPipeline("explain_document_dl")
chunkToDoc = Chunk2Doc().setInputCols("entities").setOutputCol("chunkConverted")
explainResult = pipeline.transform(data)
result = chunkToDoc.transform(explainResult)
result.selectExpr("explode(chunkConverted)").show(truncate=False)
+------------------------------------------------------------------------------+
|col |
+------------------------------------------------------------------------------+
|[document, 0, 7, New York, [entity -> LOC, sentence -> 0, chunk -> 0], []] |
|[document, 13, 22, New Jersey, [entity -> LOC, sentence -> 0, chunk -> 1], []]|
+------------------------------------------------------------------------------+
// Location entities are extracted and converted back into `DOCUMENT` type for further processing
import spark.implicits._
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
import com.johnsnowlabs.nlp.Chunk2Doc
val data = Seq((1, "New York and New Jersey aren't that far apart actually.")).toDF("id", "text")
// Extracts Named Entities amongst other things
val pipeline = PretrainedPipeline("explain_document_dl")
val chunkToDoc = new Chunk2Doc().setInputCols("entities").setOutputCol("chunkConverted")
val explainResult = pipeline.transform(data)
val result = chunkToDoc.transform(explainResult)
result.selectExpr("explode(chunkConverted)").show(false)
+------------------------------------------------------------------------------+
|col |
+------------------------------------------------------------------------------+
|[document, 0, 7, New York, [entity -> LOC, sentence -> 0, chunk -> 0], []] |
|[document, 13, 22, New Jersey, [entity -> LOC, sentence -> 0, chunk -> 1], []]|
+------------------------------------------------------------------------------+
ChunkEmbeddings
This annotator utilizes WordEmbeddings, BertEmbeddings etc. to generate chunk embeddings from either Chunker, NGramGenerator, or NerConverter outputs.
For extended examples of usage, see the Examples and the ChunkEmbeddingsTestSpec.
Input Annotator Types: CHUNK, WORD_EMBEDDINGS
Output Annotator Type: WORD_EMBEDDINGS
Python API: ChunkEmbeddings | Scala API: ChunkEmbeddings | Source: ChunkEmbeddings |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
# Extract the Embeddings from the NGrams
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
nGrams = NGramGenerator() \
.setInputCols(["token"]) \
.setOutputCol("chunk") \
.setN(2)
embeddings = WordEmbeddingsModel.pretrained() \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(False)
# Convert the NGram chunks into Word Embeddings
chunkEmbeddings = ChunkEmbeddings() \
.setInputCols(["chunk", "embeddings"]) \
.setOutputCol("chunk_embeddings") \
.setPoolingStrategy("AVERAGE")
pipeline = Pipeline() \
.setStages([
documentAssembler,
sentence,
tokenizer,
nGrams,
embeddings,
chunkEmbeddings
])
data = spark.createDataFrame([["This is a sentence."]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("explode(chunk_embeddings) as result") \
.select("result.annotatorType", "result.result", "result.embeddings") \
.show(5, 80)
+---------------+----------+--------------------------------------------------------------------------------+
| annotatorType| result| embeddings|
+---------------+----------+--------------------------------------------------------------------------------+
|word_embeddings| This is|[-0.55661, 0.42829502, 0.86661, -0.409785, 0.06316501, 0.120775, -0.0732005, ...|
|word_embeddings| is a|[-0.40674996, 0.22938299, 0.50597, -0.288195, 0.555655, 0.465145, 0.140118, 0...|
|word_embeddings|a sentence|[0.17417, 0.095253006, -0.0530925, -0.218465, 0.714395, 0.79860497, 0.0129999...|
|word_embeddings|sentence .|[0.139705, 0.177955, 0.1887775, -0.45545, 0.20030999, 0.461557, -0.07891501, ...|
+---------------+----------+--------------------------------------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
import com.johnsnowlabs.nlp.annotators.{NGramGenerator, Tokenizer}
import com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel
import com.johnsnowlabs.nlp.embeddings.ChunkEmbeddings
import org.apache.spark.ml.Pipeline
// Extract the Embeddings from the NGrams
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val nGrams = new NGramGenerator()
.setInputCols("token")
.setOutputCol("chunk")
.setN(2)
val embeddings = WordEmbeddingsModel.pretrained()
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
.setCaseSensitive(false)
// Convert the NGram chunks into Word Embeddings
val chunkEmbeddings = new ChunkEmbeddings()
.setInputCols("chunk", "embeddings")
.setOutputCol("chunk_embeddings")
.setPoolingStrategy("AVERAGE")
val pipeline = new Pipeline()
.setStages(Array(
documentAssembler,
sentence,
tokenizer,
nGrams,
embeddings,
chunkEmbeddings
))
val data = Seq("This is a sentence.").toDF("text")
val result = pipeline.fit(data).transform(data)
result.selectExpr("explode(chunk_embeddings) as result")
.select("result.annotatorType", "result.result", "result.embeddings")
.show(5, 80)
+---------------+----------+--------------------------------------------------------------------------------+
| annotatorType| result| embeddings|
+---------------+----------+--------------------------------------------------------------------------------+
|word_embeddings| This is|[-0.55661, 0.42829502, 0.86661, -0.409785, 0.06316501, 0.120775, -0.0732005, ...|
|word_embeddings| is a|[-0.40674996, 0.22938299, 0.50597, -0.288195, 0.555655, 0.465145, 0.140118, 0...|
|word_embeddings|a sentence|[0.17417, 0.095253006, -0.0530925, -0.218465, 0.714395, 0.79860497, 0.0129999...|
|word_embeddings|sentence .|[0.139705, 0.177955, 0.1887775, -0.45545, 0.20030999, 0.461557, -0.07891501, ...|
+---------------+----------+--------------------------------------------------------------------------------+
ChunkTokenizer
Tokenizes and flattens extracted NER chunks.
The ChunkTokenizer will split the extracted NER CHUNK
type Annotations and will create TOKEN
type Annotations.
The result is then flattened, resulting in a single array.
For extended examples of usage, see the ChunkTokenizerTestSpec.
Input Annotator Types: CHUNK
Output Annotator Type: TOKEN
Python API: ChunkTokenizer | Scala API: ChunkTokenizer | Source: ChunkTokenizer |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
entityExtractor = TextMatcher() \
.setInputCols(["sentence", "token"]) \
.setEntities("src/test/resources/entity-extractor/test-chunks.txt", ReadAs.TEXT) \
.setOutputCol("entity")
chunkTokenizer = ChunkTokenizer() \
.setInputCols(["entity"]) \
.setOutputCol("chunk_token")
pipeline = Pipeline().setStages([
documentAssembler,
sentenceDetector,
tokenizer,
entityExtractor,
chunkTokenizer
])
data = spark.createDataFrame([[
"Hello world, my name is Michael, I am an artist and I work at Benezar",
"Robert, an engineer from Farendell, graduated last year. The other one, Lucas, graduated last week."
]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("entity.result as entity" , "chunk_token.result as chunk_token").show(truncate=False)
+-----------------------------------------------+---------------------------------------------------+
|entity |chunk_token |
+-----------------------------------------------+---------------------------------------------------+
|[world, Michael, work at Benezar] |[world, Michael, work, at, Benezar] |
|[engineer from Farendell, last year, last week]|[engineer, from, Farendell, last, year, last, week]|
+-----------------------------------------------+---------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.{ChunkTokenizer, TextMatcher, Tokenizer}
import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
import com.johnsnowlabs.nlp.util.io.ReadAs
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val entityExtractor = new TextMatcher()
.setInputCols("sentence", "token")
.setEntities("src/test/resources/entity-extractor/test-chunks.txt", ReadAs.TEXT)
.setOutputCol("entity")
val chunkTokenizer = new ChunkTokenizer()
.setInputCols("entity")
.setOutputCol("chunk_token")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
entityExtractor,
chunkTokenizer
))
val data = Seq(
"Hello world, my name is Michael, I am an artist and I work at Benezar",
"Robert, an engineer from Farendell, graduated last year. The other one, Lucas, graduated last week."
).toDF("text")
val result = pipeline.fit(data).transform(data)
result.selectExpr("entity.result as entity" , "chunk_token.result as chunk_token").show(false)
+-----------------------------------------------+---------------------------------------------------+
|entity |chunk_token |
+-----------------------------------------------+---------------------------------------------------+
|[world, Michael, work at Benezar] |[world, Michael, work, at, Benezar] |
|[engineer from Farendell, last year, last week]|[engineer, from, Farendell, last, year, last, week]|
+-----------------------------------------------+---------------------------------------------------+
Chunker
This annotator matches a pattern of part-of-speech tags in order to return meaningful phrases from document.
Extracted part-of-speech tags are mapped onto the sentence, which can then be parsed by regular expressions.
The part-of-speech tags are wrapped by angle brackets <>
to be easily distinguishable in the text itself.
This example sentence will result in the form:
"Peter Pipers employees are picking pecks of pickled peppers."
"<NNP><NNP><NNS><VBP><VBG><NNS><IN><JJ><NNS><.>"
To then extract these tags, regexParsers
need to be set with e.g.:
val chunker = new Chunker()
.setInputCols("sentence", "pos")
.setOutputCol("chunk")
.setRegexParsers(Array("<NNP>+", "<NNS>+"))
When defining the regular expressions, tags enclosed in angle brackets are treated as groups, so here specifically
"<NNP>+"
means 1 or more nouns in succession. Additional patterns can also be set with addRegexParsers
.
For more extended examples see the Examples) and the ChunkerTestSpec.
Input Annotator Types: DOCUMENT, POS
Output Annotator Type: CHUNK
Python API: Chunker | Scala API: Chunker | Source: Chunker |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence = SentenceDetector() \
.setInputCols("document") \
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
POSTag = PerceptronModel.pretrained() \
.setInputCols("document", "token") \
.setOutputCol("pos")
chunker = Chunker() \
.setInputCols("sentence", "pos") \
.setOutputCol("chunk") \
.setRegexParsers(["<NNP>+", "<NNS>+"])
pipeline = Pipeline() \
.setStages([
documentAssembler,
sentence,
tokenizer,
POSTag,
chunker
])
data = spark.createDataFrame([["Peter Pipers employees are picking pecks of pickled peppers."]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("explode(chunk) as result").show(truncate=False)
+-------------------------------------------------------------+
|result |
+-------------------------------------------------------------+
|[chunk, 0, 11, Peter Pipers, [sentence -> 0, chunk -> 0], []]|
|[chunk, 13, 21, employees, [sentence -> 0, chunk -> 1], []] |
|[chunk, 35, 39, pecks, [sentence -> 0, chunk -> 2], []] |
|[chunk, 52, 58, peppers, [sentence -> 0, chunk -> 3], []] |
+-------------------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.{Chunker, Tokenizer}
import com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel
import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val POSTag = PerceptronModel.pretrained()
.setInputCols("document", "token")
.setOutputCol("pos")
val chunker = new Chunker()
.setInputCols("sentence", "pos")
.setOutputCol("chunk")
.setRegexParsers(Array("<NNP>+", "<NNS>+"))
val pipeline = new Pipeline()
.setStages(Array(
documentAssembler,
sentence,
tokenizer,
POSTag,
chunker
))
val data = Seq("Peter Pipers employees are picking pecks of pickled peppers.").toDF("text")
val result = pipeline.fit(data).transform(data)
result.selectExpr("explode(chunk) as result").show(false)
+-------------------------------------------------------------+
|result |
+-------------------------------------------------------------+
|[chunk, 0, 11, Peter Pipers, [sentence -> 0, chunk -> 0], []]|
|[chunk, 13, 21, employees, [sentence -> 0, chunk -> 1], []] |
|[chunk, 35, 39, pecks, [sentence -> 0, chunk -> 2], []] |
|[chunk, 52, 58, peppers, [sentence -> 0, chunk -> 3], []] |
+-------------------------------------------------------------+
ClassifierDL
Trains a ClassifierDL for generic Multi-class Text Classification.
ClassifierDL uses the state-of-the-art Universal Sentence Encoder as an input for text classifications. The ClassifierDL annotator uses a deep learning model (DNNs) we have built inside TensorFlow and supports up to 100 classes.
For instantiated/pretrained models, see ClassifierDLModel.
Setting a test dataset to monitor model metrics can be done with .setTestDataset
. The
method expects a path to a parquet file containing a dataframe that has the same
required columns as the training dataframe. The pre-processing steps for the training
dataframe should also be applied to the test dataframe. The following example will show
how to create the test dataset:
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val embeddings = UniversalSentenceEncoder.pretrained()
.setInputCols("document")
.setOutputCol("sentence_embeddings")
val preProcessingPipeline = new Pipeline().setStages(Array(documentAssembler, embeddings))
val Array(train, test) = data.randomSplit(Array(0.8, 0.2))
preProcessingPipeline
.fit(test)
.transform(test)
.write
.mode("overwrite")
.parquet("test_data")
val classifier = new ClassifierDLApproach()
.setInputCols("sentence_embeddings")
.setOutputCol("category")
.setLabelColumn("label")
.setTestDataset("test_data")
For extended examples of usage, see the Examples [1] [2] and the ClassifierDLTestSpec.
Input Annotator Types: SENTENCE_EMBEDDINGS
Output Annotator Type: CATEGORY
Note: This annotator accepts a label column of a single item in either type of String, Int, Float, or Double. UniversalSentenceEncoder, BertSentenceEmbeddings, or SentenceEmbeddings can be used for the inputCol
Python API: ClassifierDLApproach | Scala API: ClassifierDLApproach | Source: ClassifierDLApproach |
Show Example
# In this example, the training data `"sentiment.csv"` has the form of
#
# text,label
# This movie is the best movie I have wached ever! In my opinion this movie can win an award.,0
# This was a terrible movie! The acting was bad really bad!,1
# ...
#
# Then traning can be done like so:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
smallCorpus = spark.read.option("header","True").csv("src/test/resources/classifier/sentiment.csv")
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
useEmbeddings = UniversalSentenceEncoder.pretrained() \
.setInputCols("document") \
.setOutputCol("sentence_embeddings")
docClassifier = ClassifierDLApproach() \
.setInputCols("sentence_embeddings") \
.setOutputCol("category") \
.setLabelColumn("label") \
.setBatchSize(64) \
.setMaxEpochs(20) \
.setLr(5e-3) \
.setDropout(0.5)
pipeline = Pipeline() \
.setStages(
[
documentAssembler,
useEmbeddings,
docClassifier
]
)
pipelineModel = pipeline.fit(smallCorpus)
// In this example, the training data `"sentiment.csv"` has the form of
//
// text,label
// This movie is the best movie I have wached ever! In my opinion this movie can win an award.,0
// This was a terrible movie! The acting was bad really bad!,1
// ...
//
// Then traning can be done like so:
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.embeddings.UniversalSentenceEncoder
import com.johnsnowlabs.nlp.annotators.classifier.dl.ClassifierDLApproach
import org.apache.spark.ml.Pipeline
val smallCorpus = spark.read.option("header","true").csv("src/test/resources/classifier/sentiment.csv")
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val useEmbeddings = UniversalSentenceEncoder.pretrained()
.setInputCols("document")
.setOutputCol("sentence_embeddings")
val docClassifier = new ClassifierDLApproach()
.setInputCols("sentence_embeddings")
.setOutputCol("category")
.setLabelColumn("label")
.setBatchSize(64)
.setMaxEpochs(20)
.setLr(5e-3f)
.setDropout(0.5f)
val pipeline = new Pipeline()
.setStages(
Array(
documentAssembler,
useEmbeddings,
docClassifier
)
)
val pipelineModel = pipeline.fit(smallCorpus)
ContextSpellChecker
Trains a deep-learning based Noisy Channel Model Spell Algorithm. Correction candidates are extracted combining context information and word information.
For instantiated/pretrained models, see ContextSpellCheckerModel.
Spell Checking is a sequence to sequence mapping problem. Given an input sequence, potentially containing a
certain number of errors, ContextSpellChecker
will rank correction sequences according to three things:
- Different correction candidates for each word — word level.
- The surrounding text of each word, i.e. it’s context — sentence level.
- The relative cost of different correction candidates according to the edit operations at the character level it requires — subword level.
For an in-depth explanation of the module see the article Applying Context Aware Spell Checking in Spark NLP.
For extended examples of usage, see the article Training a Contextual Spell Checker for Italian Language, the Examples and the ContextSpellCheckerTestSpec.
Input Annotator Types: TOKEN
Output Annotator Type: TOKEN
Python API: ContextSpellCheckerApproach | Scala API: ContextSpellCheckerApproach | Source: ContextSpellCheckerApproach |
Show Example
# For this example, we use the first Sherlock Holmes book as the training dataset.
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
spellChecker = ContextSpellCheckerApproach() \
.setInputCols("token") \
.setOutputCol("corrected") \
.setWordMaxDistance(3) \
.setBatchSize(24) \
.setEpochs(8) \
.setLanguageModelClasses(1650) # dependant on vocabulary size
# .addVocabClass("_NAME_", names) # Extra classes for correction could be added like this
pipeline = Pipeline().setStages([
documentAssembler,
tokenizer,
spellChecker
])
path = "sherlockholmes.txt"
dataset = spark.read.text(path) \
.toDF("text")
pipelineModel = pipeline.fit(dataset)
// For this example, we use the first Sherlock Holmes book as the training dataset.
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.spell.context.ContextSpellCheckerApproach
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val spellChecker = new ContextSpellCheckerApproach()
.setInputCols("token")
.setOutputCol("corrected")
.setWordMaxDistance(3)
.setBatchSize(24)
.setEpochs(8)
.setLanguageModelClasses(1650) // dependant on vocabulary size
// .addVocabClass("_NAME_", names) // Extra classes for correction could be added like this
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
tokenizer,
spellChecker
))
val path = "src/test/resources/spell/sherlockholmes.txt"
val dataset = spark.sparkContext.textFile(path)
.toDF("text")
val pipelineModel = pipeline.fit(dataset)
Date2Chunk
Converts DATE
type Annotations to CHUNK
type.
This can be useful if the following annotators after DateMatcher and MultiDateMatcher require
CHUNK
types. The entity name in the metadata can be changed with setEntityName
.
Input Annotator Types: DATE
Output Annotator Type: CHUNK
Python API: Date2Chunk | Scala API: Date2Chunk | Source: Date2Chunk |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
date = DateMatcher() \
.setInputCols(["document"]) \
.setOutputCol("date")
date2Chunk = Date2Chunk() \
.setInputCols(["date"]) \
.setOutputCol("date_chunk")
pipeline = Pipeline().setStages([
documentAssembler,
date,
date2Chunk
])
data = spark.createDataFrame([["Omicron is a new variant of COVID-19, which the World Health Organization designated a variant of concern on Nov. 26, 2021/26/11."]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.select("date_chunk").show(1, truncate=False)
+----------------------------------------------------+
|date_chunk |
+----------------------------------------------------+
|[{chunk, 118, 121, 2021/01/01, {sentence -> 0}, []}]|
+----------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val inputFormats = Array("yyyy", "yyyy/dd/MM", "MM/yyyy", "yyyy")
val outputFormat = "yyyy/MM/dd"
val date = new DateMatcher()
.setInputCols("document")
.setOutputCol("date")
val date2Chunk = new Date2Chunk()
.setInputCols("date")
.setOutputCol("date_chunk")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
date,
date2Chunk
))
val data = Seq(
"""Omicron is a new variant of COVID-19, which the World Health Organization designated a variant of concern on Nov. 26, 2021/26/11.""",
"""Neighbouring Austria has already locked down its population this week for at until 2021/10/12, becoming the first to reimpose such restrictions."""
).toDF("text")
val result = pipeline.fit(data).transform(data)
result.transform(data).select("date_chunk").show(false)
----------------------------------------------------+
date_chunk |
----------------------------------------------------+
[{chunk, 118, 121, 2021/01/01, {sentence -> 0}, []}]|
[{chunk, 83, 86, 2021/01/01, {sentence -> 0}, []}] |
----------------------------------------------------+
DateMatcher
Matches standard date formats into a provided format.
Reads from different forms of date and time expressions and converts them to a provided date format.
Extracts only one date per document. Use with sentence detector to find matches in each sentence. To extract multiple dates from a document, please use the MultiDateMatcher.
Reads the following kind of dates:
"1978-01-28", "1984/04/02,1/02/1980", "2/28/79", "The 31st of April in the year 2008",
"Fri, 21 Nov 1997", "Jan 21, ‘97", "Sun", "Nov 21", "jan 1st", "next thursday",
"last wednesday", "today", "tomorrow", "yesterday", "next week", "next month",
"next year", "day after", "the day before", "0600h", "06:00 hours", "6pm", "5:30 a.m.",
"at 5", "12:59", "23:59", "1988/11/23 6pm", "next week at 7.30", "5 am tomorrow"
For example "The 31st of April in the year 2008"
will be converted into 2008/04/31
.
Pretrained pipelines are available for this module, see Pipelines.
For extended examples of usage, see the Examples and the DateMatcherTestSpec.
Input Annotator Types: DOCUMENT
Output Annotator Type: DATE
Python API: DateMatcher | Scala API: DateMatcher | Source: DateMatcher |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
date = DateMatcher() \
.setInputCols("document") \
.setOutputCol("date") \
.setAnchorDateYear(2020) \
.setAnchorDateMonth(1) \
.setAnchorDateDay(11) \
.setDateFormat("yyyy/MM/dd")
pipeline = Pipeline().setStages([
documentAssembler,
date
])
data = spark.createDataFrame([["Fri, 21 Nov 1997"], ["next week at 7.30"], ["see you a day after"]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("date").show(truncate=False)
+-------------------------------------------------+
|date |
+-------------------------------------------------+
|[[date, 5, 15, 1997/11/21, [sentence -> 0], []]] |
|[[date, 0, 8, 2020/01/18, [sentence -> 0], []]] |
|[[date, 10, 18, 2020/01/12, [sentence -> 0], []]]|
+-------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.DateMatcher
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val date = new DateMatcher()
.setInputCols("document")
.setOutputCol("date")
.setAnchorDateYear(2020)
.setAnchorDateMonth(1)
.setAnchorDateDay(11)
.setDateFormat("yyyy/MM/dd")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
date
))
val data = Seq("Fri, 21 Nov 1997", "next week at 7.30", "see you a day after").toDF("text")
val result = pipeline.fit(data).transform(data)
result.selectExpr("date").show(false)
+-------------------------------------------------+
|date |
+-------------------------------------------------+
|[[date, 5, 15, 1997/11/21, [sentence -> 0], []]] |
|[[date, 0, 8, 2020/01/18, [sentence -> 0], []]] |
|[[date, 10, 18, 2020/01/12, [sentence -> 0], []]]|
+-------------------------------------------------+
DependencyParser
Trains an unlabeled parser that finds a grammatical relations between two words in a sentence.
For instantiated/pretrained models, see DependencyParserModel.
Dependency parser provides information about word relationship. For example, dependency parsing can tell you what the subjects and objects of a verb are, as well as which words are modifying (describing) the subject. This can help you find precise answers to specific questions.
The required training data can be set in two different ways (only one can be chosen for a particular model):
- Dependency treebank in the Penn Treebank format set with
setDependencyTreeBank
- Dataset in the CoNLL-U format set with
setConllU
Apart from that, no additional training data is needed.
See DependencyParserApproachTestSpec for further reference on how to use this API.
Input Annotator Types: DOCUMENT, POS, TOKEN
Output Annotator Type: DEPENDENCY
Python API: DependencyParserApproach | Scala API: DependencyParserApproach | Source: DependencyParserApproach |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence = SentenceDetector() \
.setInputCols("document") \
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
posTagger = PerceptronModel.pretrained() \
.setInputCols("sentence", "token") \
.setOutputCol("pos")
dependencyParserApproach = DependencyParserApproach() \
.setInputCols("sentence", "pos", "token") \
.setOutputCol("dependency") \
.setDependencyTreeBank("src/test/resources/parser/unlabeled/dependency_treebank")
pipeline = Pipeline().setStages([
documentAssembler,
sentence,
tokenizer,
posTagger,
dependencyParserApproach
])
# Additional training data is not needed, the dependency parser relies on the dependency tree bank / CoNLL-U only.
emptyDataSet = spark.createDataFrame([[""]]).toDF("text")
pipelineModel = pipeline.fit(emptyDataSet)
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel
import com.johnsnowlabs.nlp.annotators.parser.dep.DependencyParserApproach
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val posTagger = PerceptronModel.pretrained()
.setInputCols("sentence", "token")
.setOutputCol("pos")
val dependencyParserApproach = new DependencyParserApproach()
.setInputCols("sentence", "pos", "token")
.setOutputCol("dependency")
.setDependencyTreeBank("src/test/resources/parser/unlabeled/dependency_treebank")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentence,
tokenizer,
posTagger,
dependencyParserApproach
))
// Additional training data is not needed, the dependency parser relies on the dependency tree bank / CoNLL-U only.
val emptyDataSet = Seq.empty[String].toDF("text")
val pipelineModel = pipeline.fit(emptyDataSet)
Doc2Chunk
Converts DOCUMENT
type annotations into CHUNK
type with the contents of a chunkCol
.
Chunk text must be contained within input DOCUMENT
. May be either StringType
or ArrayType[StringType]
(using setIsArray). Useful for annotators that require a CHUNK type input.
Input Annotator Types: DOCUMENT
Output Annotator Type: CHUNK
Python API: Doc2Chunk | Scala API: Doc2Chunk | Source: Doc2Chunk |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
chunkAssembler = Doc2Chunk() \
.setInputCols("document") \
.setChunkCol("target") \
.setOutputCol("chunk") \
.setIsArray(True)
data = spark.createDataFrame([[
"Spark NLP is an open-source text processing library for advanced natural language processing.",
["Spark NLP", "text processing library", "natural language processing"]
]]).toDF("text", "target")
pipeline = Pipeline().setStages([documentAssembler, chunkAssembler]).fit(data)
result = pipeline.transform(data)
result.selectExpr("chunk.result", "chunk.annotatorType").show(truncate=False)
+-----------------------------------------------------------------+---------------------+
|result |annotatorType |
+-----------------------------------------------------------------+---------------------+
|[Spark NLP, text processing library, natural language processing]|[chunk, chunk, chunk]|
+-----------------------------------------------------------------+---------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.{Doc2Chunk, DocumentAssembler}
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document")
val chunkAssembler = new Doc2Chunk()
.setInputCols("document")
.setChunkCol("target")
.setOutputCol("chunk")
.setIsArray(true)
val data = Seq(
("Spark NLP is an open-source text processing library for advanced natural language processing.",
Seq("Spark NLP", "text processing library", "natural language processing"))
).toDF("text", "target")
val pipeline = new Pipeline().setStages(Array(documentAssembler, chunkAssembler)).fit(data)
val result = pipeline.transform(data)
result.selectExpr("chunk.result", "chunk.annotatorType").show(false)
+-----------------------------------------------------------------+---------------------+
|result |annotatorType |
+-----------------------------------------------------------------+---------------------+
|[Spark NLP, text processing library, natural language processing]|[chunk, chunk, chunk]|
+-----------------------------------------------------------------+---------------------+
Doc2Vec
Trains a Word2Vec model that creates vector representations of words in a text corpus.
The algorithm first constructs a vocabulary from the corpus and then learns vector representation of words in the vocabulary. The vector representation can be used as features in natural language processing and machine learning algorithms.
We use Word2Vec implemented in Spark ML. It uses skip-gram model in our implementation and a hierarchical softmax method to train the model. The variable names in the implementation match the original C implementation.
For instantiated/pretrained models, see Doc2VecModel.
Sources :
For the original C implementation, see https://code.google.com/p/word2vec/
For the research paper, see Efficient Estimation of Word Representations in Vector Space and Distributed Representations of Words and Phrases and their Compositionality.
Input Annotator Types: TOKEN
Output Annotator Type: SENTENCE_EMBEDDINGS
Python API: Doc2VecApproach | Scala API: Doc2VecApproach | Source: Doc2VecApproach |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
embeddings = Doc2VecApproach() \
.setInputCols(["token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline() \
.setStages([
documentAssembler,
tokenizer,
embeddings
])
path = "sherlockholmes.txt"
dataset = spark.read.text(path).toDF("text")
pipelineModel = pipeline.fit(dataset)
import spark.implicits._
import com.johnsnowlabs.nlp.annotator.{Tokenizer, Doc2VecApproach}
import com.johnsnowlabs.nlp.base.DocumentAssembler
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = new Doc2VecApproach()
.setInputCols("token")
.setOutputCol("embeddings")
val pipeline = new Pipeline()
.setStages(Array(
documentAssembler,
tokenizer,
embeddings
))
val path = "src/test/resources/spell/sherlockholmes.txt"
val dataset = spark.sparkContext.textFile(path)
.toDF("text")
val pipelineModel = pipeline.fit(dataset)
DocumentAssembler
Prepares data into a format that is processable by Spark NLP. This is the entry point for every Spark NLP pipeline.
The DocumentAssembler
reads String
columns. Additionally, setCleanupMode
can be used to pre-process the text (Default: disabled
). For possible options please refer the parameters section.
For more extended examples on document pre-processing see the Examples.
Input Annotator Types: NONE
Output Annotator Type: DOCUMENT
Python API: DocumentAssembler | Scala API: DocumentAssembler | Source: DocumentAssembler |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
data = spark.createDataFrame([["Spark NLP is an open-source text processing library."]]).toDF("text")
documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
result = documentAssembler.transform(data)
result.select("document").show(truncate=False)
+----------------------------------------------------------------------------------------------+
|document |
+----------------------------------------------------------------------------------------------+
|[[document, 0, 51, Spark NLP is an open-source text processing library., [sentence -> 0], []]]|
+----------------------------------------------------------------------------------------------+
result.select("document").printSchema
root
|-- document: array (nullable = True)
| |-- element: struct (containsNull = True)
| | |-- annotatorType: string (nullable = True)
| | |-- begin: integer (nullable = False)
| | |-- end: integer (nullable = False)
| | |-- result: string (nullable = True)
| | |-- metadata: map (nullable = True)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = True)
| | |-- embeddings: array (nullable = True)
| | | |-- element: float (containsNull = False)
import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
val data = Seq("Spark NLP is an open-source text processing library.").toDF("text")
val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document")
val result = documentAssembler.transform(data)
result.select("document").show(false)
+----------------------------------------------------------------------------------------------+
|document |
+----------------------------------------------------------------------------------------------+
|[[document, 0, 51, Spark NLP is an open-source text processing library., [sentence -> 0], []]]|
+----------------------------------------------------------------------------------------------+
result.select("document").printSchema
root
|-- document: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- annotatorType: string (nullable = true)
| | |-- begin: integer (nullable = false)
| | |-- end: integer (nullable = false)
| | |-- result: string (nullable = true)
| | |-- metadata: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)
| | |-- embeddings: array (nullable = true)
| | | |-- element: float (containsNull = false)
DocumentCharacterTextSplitter
Annotator which splits large documents into chunks of roughly given size.
DocumentCharacterTextSplitter takes a list of separators. It takes the separators in order and splits subtexts if they are over the chunk length, considering optional overlap of the chunks.
For example, given chunk size 20 and overlap 5:
"He was, I take it, the most perfect reasoning and observing machine that the world has seen."
["He was, I take it,", "it, the most", "most perfect", "reasoning and", "and observing", "machine that the", "the world has seen."]
Additionally, you can set
- custom patterns with setSplitPatterns
- whether patterns should be interpreted as regex with setPatternsAreRegex
- whether to keep the separators with setKeepSeparators
- whether to trim whitespaces with setTrimWhitespace
- whether to explode the splits to individual rows with setExplodeSplits
For extended examples of usage, see the DocumentCharacterTextSplitterTest.
Input Annotator Types: DOCUMENT
Output Annotator Type: DOCUMENT
Python API: DocumentCharacterTextSplitter | Scala API: DocumentCharacterTextSplitter | Source: DocumentCharacterTextSplitter |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
textDF = spark.read.text(
"sherlockholmes.txt",
wholetext=True
).toDF("text")
documentAssembler = DocumentAssembler().setInputCol("text")
textSplitter = DocumentCharacterTextSplitter() \
.setInputCols(["document"]) \
.setOutputCol("splits") \
.setChunkSize(20000) \
.setChunkOverlap(200) \
.setExplodeSplits(True)
pipeline = Pipeline().setStages([documentAssembler, textSplitter])
result = pipeline.fit(textDF).transform(textDF)
result.selectExpr(
"splits.result",
"splits[0].begin",
"splits[0].end",
"splits[0].end - splits[0].begin as length") \
.show(8, truncate = 80)
+--------------------------------------------------------------------------------+---------------+-------------+------+
| result|splits[0].begin|splits[0].end|length|
+--------------------------------------------------------------------------------+---------------+-------------+------+
|[ Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyl...| 0| 19994| 19994|
|["And Mademoiselle's address?" he asked.\n\n"Is Briony Lodge, Serpentine Aven...| 19798| 39395| 19597|
|["How did that help you?"\n\n"It was all-important. When a woman thinks that ...| 39371| 59242| 19871|
|["'But,' said I, 'there would be millions of red-headed men who\nwould apply....| 59166| 77833| 18667|
|[My friend was an enthusiastic musician, being himself not only a\nvery capab...| 77835| 97769| 19934|
|["And yet I am not convinced of it," I answered. "The cases which\ncome to li...| 97771| 117248| 19477|
|["Well, she had a slate-coloured, broad-brimmed straw hat, with a\nfeather of...| 117250| 137242| 19992|
|["That sounds a little paradoxical."\n\n"But it is profoundly True. Singulari...| 137244| 157171| 19927|
+--------------------------------------------------------------------------------+---------------+-------------+------+
import com.johnsnowlabs.nlp.annotator._
import com.johnsnowlabs.nlp.DocumentAssembler
import org.apache.spark.ml.Pipeline
val textDF =
spark.read
.option("wholetext", "true")
.text("src/test/resources/spell/sherlockholmes.txt")
.toDF("text")
val documentAssembler = new DocumentAssembler().setInputCol("text")
val textSplitter = new DocumentCharacterTextSplitter()
.setInputCols("document")
.setOutputCol("splits")
.setChunkSize(20000)
.setChunkOverlap(200)
.setExplodeSplits(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, textSplitter))
val result = pipeline.fit(textDF).transform(textDF)
result
.selectExpr(
"splits.result",
"splits[0].begin",
"splits[0].end",
"splits[0].end - splits[0].begin as length")
.show(8, truncate = 80)
+--------------------------------------------------------------------------------+---------------+-------------+------+
| result|splits[0].begin|splits[0].end|length|
+--------------------------------------------------------------------------------+---------------+-------------+------+
|[ Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyl...| 0| 19994| 19994|
|["And Mademoiselle's address?" he asked.\n\n"Is Briony Lodge, Serpentine Aven...| 19798| 39395| 19597|
|["How did that help you?"\n\n"It was all-important. When a woman thinks that ...| 39371| 59242| 19871|
|["'But,' said I, 'there would be millions of red-headed men who\nwould apply....| 59166| 77833| 18667|
|[My friend was an enthusiastic musician, being himself not only a\nvery capab...| 77835| 97769| 19934|
|["And yet I am not convinced of it," I answered. "The cases which\ncome to li...| 97771| 117248| 19477|
|["Well, she had a slate-coloured, broad-brimmed straw hat, with a\nfeather of...| 117250| 137242| 19992|
|["That sounds a little paradoxical."\n\n"But it is profoundly true. Singulari...| 137244| 157171| 19927|
+--------------------------------------------------------------------------------+---------------+-------------+------+
DocumentNormalizer
Annotator which normalizes raw text from tagged text, e.g. scraped web pages or xml documents, from document type columns into Sentence. Removes all dirty characters from text following one or more input regex patterns. Can apply not wanted character removal with a specific policy. Can apply lower case normalization.
For extended examples of usage, see the Examples.
Input Annotator Types: DOCUMENT
Output Annotator Type: DOCUMENT
Python API: DocumentNormalizer | Scala API: DocumentNormalizer | Source: DocumentNormalizer |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
cleanUpPatterns = ["<[^>]>"]
documentNormalizer = DocumentNormalizer() \
.setInputCols("document") \
.setOutputCol("normalizedDocument") \
.setAction("clean") \
.setPatterns(cleanUpPatterns) \
.setReplacement(" ") \
.setPolicy("pretty_all") \
.setLowercase(True)
pipeline = Pipeline().setStages([
documentAssembler,
documentNormalizer
])
text = """
<div id="theworldsgreatest" class='my-right my-hide-small my-wide toptext' style="font-family:'Segoe UI',Arial,sans-serif">
THE WORLD'S LARGEST WEB DEVELOPER SITE
<h1 style="font-size:300%;">THE WORLD'S LARGEST WEB DEVELOPER SITE</h1>
<p style="font-size:160%;">Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum..</p>
</div>
</div>"""
data = spark.createDataFrame([[text]]).toDF("text")
pipelineModel = pipeline.fit(data)
result = pipelineModel.transform(data)
result.selectExpr("normalizedDocument.result").show(truncate=False)
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[ the world's largest web developer site the world's largest web developer site lorem ipsum is simply dummy text of the printing and typesetting industry. lorem ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. it has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. it was popularised in the 1960s with the release of letraset sheets containing lorem ipsum passages, and more recently with desktop publishing software like aldus pagemaker including versions of lorem ipsum..]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.DocumentNormalizer
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val cleanUpPatterns = Array("<[^>]>")
val documentNormalizer = new DocumentNormalizer()
.setInputCols("document")
.setOutputCol("normalizedDocument")
.setAction("clean")
.setPatterns(cleanUpPatterns)
.setReplacement(" ")
.setPolicy("pretty_all")
.setLowercase(true)
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
documentNormalizer
))
val text =
"""
<div id="theworldsgreatest" class='my-right my-hide-small my-wide toptext' style="font-family:'Segoe UI',Arial,sans-serif">
THE WORLD'S LARGEST WEB DEVELOPER SITE
<h1 style="font-size:300%;">THE WORLD'S LARGEST WEB DEVELOPER SITE</h1>
<p style="font-size:160%;">Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum..</p>
</div>
</div>"""
val data = Seq(text).toDF("text")
val pipelineModel = pipeline.fit(data)
val result = pipelineModel.transform(data)
result.selectExpr("normalizedDocument.result").show(truncate=false)
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[ the world's largest web developer site the world's largest web developer site lorem ipsum is simply dummy text of the printing and typesetting industry. lorem ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. it has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. it was popularised in the 1960s with the release of letraset sheets containing lorem ipsum passages, and more recently with desktop publishing software like aldus pagemaker including versions of lorem ipsum..]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
DocumentSimilarityRanker
Annotator that uses LSH techniques present in Spark ML lib to execute approximate nearest neighbors search on top of sentence embeddings.
It aims to capture the semantic meaning of a document in a dense, continuous vector space and return it to the ranker search.
For instantiated/pretrained models, see DocumentSimilarityRankerModel.
For extended examples of usage, see the jupyter notebook Document Similarity Ranker for Spark NLP.
Input Annotator Types: SENTENCE_EMBEDDINGS
Output Annotator Type: DOC_SIMILARITY_RANKINGS
Python API: DocumentSimilarityRankerApproach | Scala API: DocumentSimilarityRankerApproach | Source: DocumentSimilarityRankerApproach |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
from sparknlp.annotator.similarity.document_similarity_ranker import *
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence_embeddings = E5Embeddings.pretrained() \
.setInputCols(["document"]) \
.setOutputCol("sentence_embeddings")
document_similarity_ranker = DocumentSimilarityRankerApproach() \
.setInputCols("sentence_embeddings") \
.setOutputCol("doc_similarity_rankings") \
.setSimilarityMethod("brp") \
.setNumberOfNeighbours(1) \
.setBucketLength(2.0) \
.setNumHashTables(3) \
.setVisibleDistances(True) \
.setIdentityRanking(False)
document_similarity_ranker_finisher = DocumentSimilarityRankerFinisher() \
.setInputCols("doc_similarity_rankings") \
.setOutputCols(
"finished_doc_similarity_rankings_id",
"finished_doc_similarity_rankings_neighbors") \
.setExtractNearestNeighbor(True)
pipeline = Pipeline(stages=[
document_assembler,
sentence_embeddings,
document_similarity_ranker,
document_similarity_ranker_finisher
])
docSimRankerPipeline = pipeline.fit(data).transform(data)
(
docSimRankerPipeline
.select(
"finished_doc_similarity_rankings_id",
"finished_doc_similarity_rankings_neighbors"
).show(10, False)
)
+-----------------------------------+------------------------------------------+
|finished_doc_similarity_rankings_id|finished_doc_similarity_rankings_neighbors|
+-----------------------------------+------------------------------------------+
|1510101612 |[(1634839239,0.12448559591306324)] |
|1634839239 |[(1510101612,0.12448559591306324)] |
|-612640902 |[(1274183715,0.1220122862046063)] |
|1274183715 |[(-612640902,0.1220122862046063)] |
|-1320876223 |[(1293373212,0.17848855164122393)] |
|1293373212 |[(-1320876223,0.17848855164122393)] |
|-1548374770 |[(-1719102856,0.23297156732534166)] |
|-1719102856 |[(-1548374770,0.23297156732534166)] |
+-----------------------------------+------------------------------------------+
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import com.johnsnowlabs.nlp.annotators.similarity.DocumentSimilarityRankerApproach
import com.johnsnowlabs.nlp.finisher.DocumentSimilarityRankerFinisher
import org.apache.spark.ml.Pipeline
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceEmbeddings = RoBertaSentenceEmbeddings
.pretrained()
.setInputCols("document")
.setOutputCol("sentence_embeddings")
val documentSimilarityRanker = new DocumentSimilarityRankerApproach()
.setInputCols("sentence_embeddings")
.setOutputCol("doc_similarity_rankings")
.setSimilarityMethod("brp")
.setNumberOfNeighbours(1)
.setBucketLength(2.0)
.setNumHashTables(3)
.setVisibleDistances(true)
.setIdentityRanking(false)
val documentSimilarityRankerFinisher = new DocumentSimilarityRankerFinisher()
.setInputCols("doc_similarity_rankings")
.setOutputCols(
"finished_doc_similarity_rankings_id",
"finished_doc_similarity_rankings_neighbors")
.setExtractNearestNeighbor(true)
// Let's use a dataset where we can visually control similarity
// Documents are coupled, as 1-2, 3-4, 5-6, 7-8 and they were create to be similar on purpose
val data = Seq(
"First document, this is my first sentence. This is my second sentence.",
"Second document, this is my second sentence. This is my second sentence.",
"Third document, climate change is arguably one of the most pressing problems of our time.",
"Fourth document, climate change is definitely one of the most pressing problems of our time.",
"Fifth document, Florence in Italy, is among the most beautiful cities in Europe.",
"Sixth document, Florence in Italy, is a very beautiful city in Europe like Lyon in France.",
"Seventh document, the French Riviera is the Mediterranean coastline of the southeast corner of France.",
"Eighth document, the warmest place in France is the French Riviera coast in Southern France.")
.toDF("text")
val pipeline = new Pipeline().setStages(
Array(
documentAssembler,
sentenceEmbeddings,
documentSimilarityRanker,
documentSimilarityRankerFinisher))
val result = pipeline.fit(data).transform(data)
result
.select("finished_doc_similarity_rankings_id", "finished_doc_similarity_rankings_neighbors")
.show(10, truncate = false)
+-----------------------------------+------------------------------------------+
|finished_doc_similarity_rankings_id|finished_doc_similarity_rankings_neighbors|
+-----------------------------------+------------------------------------------+
|1510101612 |[(1634839239,0.12448559591306324)] |
|1634839239 |[(1510101612,0.12448559591306324)] |
|-612640902 |[(1274183715,0.1220122862046063)] |
|1274183715 |[(-612640902,0.1220122862046063)] |
|-1320876223 |[(1293373212,0.17848855164122393)] |
|1293373212 |[(-1320876223,0.17848855164122393)] |
|-1548374770 |[(-1719102856,0.23297156732534166)] |
|-1719102856 |[(-1548374770,0.23297156732534166)] |
+-----------------------------------+------------------------------------------+
DocumentTokenSplitter
Annotator that splits large documents into smaller documents based on the number of tokens in the text.
Currently, DocumentTokenSplitter splits the text by whitespaces to create the tokens. The number of these tokens will then be used as a measure of the text length. In the future, other tokenization techniques will be supported.
For example, given 3 tokens and overlap 1:
"He was, I take it, the most perfect reasoning and observing machine that the world has seen."
["He was, I", "I take it,", "it, the most", "most perfect reasoning", "reasoning and observing", "observing machine that", "that the world", "world has seen."]
Additionally, you can set
- whether to trim whitespaces with setTrimWhitespace
- whether to explode the splits to individual rows with setExplodeSplits
For extended examples of usage, see the DocumentTokenSplitterTest.
Input Annotator Types: DOCUMENT
Output Annotator Type: DOCUMENT
Python API: DocumentTokenSplitter | Scala API: DocumentTokenSplitter | Source: DocumentTokenSplitter |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
textDF = spark.read.text(
"sherlockholmes.txt",
wholetext=True
).toDF("text")
documentAssembler = DocumentAssembler().setInputCol("text")
textSplitter = DocumentTokenSplitter() \
.setInputCols(["document"]) \
.setOutputCol("splits") \
.setNumTokens(512) \
.setTokenOverlap(10) \
.setExplodeSplits(True)
pipeline = Pipeline().setStages([documentAssembler, textSplitter])
result = pipeline.fit(textDF).transform(textDF)
result.selectExpr(
"splits.result as result",
"splits[0].begin as begin",
"splits[0].end as end",
"splits[0].end - splits[0].begin as length",
"splits[0].metadata.numTokens as tokens") \
.show(8, truncate = 80)
+--------------------------------------------------------------------------------+-----+-----+------+------+
| result|begin| end|length|tokens|
+--------------------------------------------------------------------------------+-----+-----+------+------+
|[ Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyl...| 0| 3018| 3018| 512|
|[study of crime, and occupied his\nimmense faculties and extraordinary powers...| 2950| 5707| 2757| 512|
|[but as I have changed my clothes I can't imagine how you\ndeduce it. As to M...| 5659| 8483| 2824| 512|
|[quarters received. Be in your chamber then at that hour, and do\nnot take it...| 8427|11241| 2814| 512|
|[a pity\nto miss it."\n\n"But your client--"\n\n"Never mind him. I may want y...|11188|13970| 2782| 512|
|[person who employs me wishes his agent to be unknown to\nyou, and I may conf...|13918|16898| 2980| 512|
|[letters back."\n\n"Precisely so. But how--"\n\n"Was there a secret marriage?...|16836|19744| 2908| 512|
|[seven hundred in\nnotes," he said.\n\nHolmes scribbled a receipt upon a shee...|19683|22551| 2868| 512|
+--------------------------------------------------------------------------------+-----+-----+------+------+
import com.johnsnowlabs.nlp.annotator._
import com.johnsnowlabs.nlp.DocumentAssembler
import org.apache.spark.ml.Pipeline
val textDF =
spark.read
.option("wholetext", "true")
.text("src/test/resources/spell/sherlockholmes.txt")
.toDF("text")
val documentAssembler = new DocumentAssembler().setInputCol("text")
val textSplitter = new DocumentTokenSplitter()
.setInputCols("document")
.setOutputCol("splits")
.setNumTokens(512)
.setTokenOverlap(10)
.setExplodeSplits(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, textSplitter))
val result = pipeline.fit(textDF).transform(textDF)
result
.selectExpr(
"splits.result as result",
"splits[0].begin as begin",
"splits[0].end as end",
"splits[0].end - splits[0].begin as length",
"splits[0].metadata.numTokens as tokens")
.show(8, truncate = 80)
+--------------------------------------------------------------------------------+-----+-----+------+------+
| result|begin| end|length|tokens|
+--------------------------------------------------------------------------------+-----+-----+------+------+
|[ Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyl...| 0| 3018| 3018| 512|
|[study of crime, and occupied his\nimmense faculties and extraordinary powers...| 2950| 5707| 2757| 512|
|[but as I have changed my clothes I can't imagine how you\ndeduce it. As to M...| 5659| 8483| 2824| 512|
|[quarters received. Be in your chamber then at that hour, and do\nnot take it...| 8427|11241| 2814| 512|
|[a pity\nto miss it."\n\n"But your client--"\n\n"Never mind him. I may want y...|11188|13970| 2782| 512|
|[person who employs me wishes his agent to be unknown to\nyou, and I may conf...|13918|16898| 2980| 512|
|[letters back."\n\n"Precisely so. But how--"\n\n"Was there a secret marriage?...|16836|19744| 2908| 512|
|[seven hundred in\nnotes," he said.\n\nHolmes scribbled a receipt upon a shee...|19683|22551| 2868| 512|
+--------------------------------------------------------------------------------+-----+-----+------+------+
EmbeddingsFinisher
Extracts embeddings from Annotations into a more easily usable form.
This is useful for example: WordEmbeddings, BertEmbeddings, SentenceEmbeddings and ChunkEmbeddings.
By using EmbeddingsFinisher
you can easily transform your embeddings into array of floats or vectors which are
compatible with Spark ML functions such as LDA, K-mean, Random Forest classifier or any other functions that require
featureCol
.
For more extended examples see the Examples.
Input Annotator Types: EMBEDDINGS
Output Annotator Type: NONE
Python API: EmbeddingsFinisher | Scala API: EmbeddingsFinisher | Source: EmbeddingsFinisher |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
normalizer = Normalizer() \
.setInputCols("token") \
.setOutputCol("normalized")
stopwordsCleaner = StopWordsCleaner() \
.setInputCols("normalized") \
.setOutputCol("cleanTokens") \
.setCaseSensitive(False)
gloveEmbeddings = WordEmbeddingsModel.pretrained() \
.setInputCols("document", "cleanTokens") \
.setOutputCol("embeddings") \
.setCaseSensitive(False)
embeddingsFinisher = EmbeddingsFinisher() \
.setInputCols("embeddings") \
.setOutputCols("finished_sentence_embeddings") \
.setOutputAsVector(True) \
.setCleanAnnotations(False)
data = spark.createDataFrame([["Spark NLP is an open-source text processing library."]]) \
.toDF("text")
pipeline = Pipeline().setStages([
documentAssembler,
tokenizer,
normalizer,
stopwordsCleaner,
gloveEmbeddings,
embeddingsFinisher
]).fit(data)
result = pipeline.transform(data)
resultWithSize = result.selectExpr("explode(finished_sentence_embeddings) as embeddings")
resultWithSize.show(5, 80)
+--------------------------------------------------------------------------------+
| embeddings|
+--------------------------------------------------------------------------------+
|[0.1619900017976761,0.045552998781204224,-0.03229299932718277,-0.685609996318...|
|[-0.42416998744010925,1.1378999948501587,-0.5717899799346924,-0.5078899860382...|
|[0.08621499687433243,-0.15772999823093414,-0.06067200005054474,0.395359992980...|
|[-0.4970499873161316,0.7164199948310852,0.40119001269340515,-0.05761000141501...|
|[-0.08170200139284134,0.7159299850463867,-0.20677000284194946,0.0295659992843...|
+--------------------------------------------------------------------------------+
import spark.implicits._
import org.apache.spark.ml.Pipeline
import com.johnsnowlabs.nlp.{DocumentAssembler, EmbeddingsFinisher}
import com.johnsnowlabs.nlp.annotator.{Normalizer, StopWordsCleaner, Tokenizer, WordEmbeddingsModel}
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val normalizer = new Normalizer()
.setInputCols("token")
.setOutputCol("normalized")
val stopwordsCleaner = new StopWordsCleaner()
.setInputCols("normalized")
.setOutputCol("cleanTokens")
.setCaseSensitive(false)
val gloveEmbeddings = WordEmbeddingsModel.pretrained()
.setInputCols("document", "cleanTokens")
.setOutputCol("embeddings")
.setCaseSensitive(false)
val embeddingsFinisher = new EmbeddingsFinisher()
.setInputCols("embeddings")
.setOutputCols("finished_sentence_embeddings")
.setOutputAsVector(true)
.setCleanAnnotations(false)
val data = Seq("Spark NLP is an open-source text processing library.")
.toDF("text")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
tokenizer,
normalizer,
stopwordsCleaner,
gloveEmbeddings,
embeddingsFinisher
)).fit(data)
val result = pipeline.transform(data)
val resultWithSize = result.selectExpr("explode(finished_sentence_embeddings)")
.map { row =>
val vector = row.getAs[org.apache.spark.ml.linalg.DenseVector](0)
(vector.size, vector)
}.toDF("size", "vector")
resultWithSize.show(5, 80)
+----+--------------------------------------------------------------------------------+
|size| vector|
+----+--------------------------------------------------------------------------------+
| 100|[0.1619900017976761,0.045552998781204224,-0.03229299932718277,-0.685609996318...|
| 100|[-0.42416998744010925,1.1378999948501587,-0.5717899799346924,-0.5078899860382...|
| 100|[0.08621499687433243,-0.15772999823093414,-0.06067200005054474,0.395359992980...|
| 100|[-0.4970499873161316,0.7164199948310852,0.40119001269340515,-0.05761000141501...|
| 100|[-0.08170200139284134,0.7159299850463867,-0.20677000284194946,0.0295659992843...|
+----+--------------------------------------------------------------------------------+
EntityRuler
Fits an Annotator to match exact strings or regex patterns provided in a file against a Document and assigns them an named entity. The definitions can contain any number of named entities.
There are multiple ways and formats to set the extraction resource. It is possible to set it either as a “JSON”,
“JSONL” or “CSV” file. A path to the file needs to be provided to setPatternsResource
. The file format needs to be
set as the “format” field in the option
parameter map and depending on the file type, additional parameters might
need to be set.
If the file is in a JSON format, then the rule definitions need to be given in a list with the fields “id”, “label” and “patterns”:
[
{
"id": "person-regex",
"label": "PERSON",
"patterns": ["\\w+\\s\\w+", "\\w+-\\w+"]
},
{
"id": "locations-words",
"label": "LOCATION",
"patterns": ["Winterfell"]
}
]
The same fields also apply to a file in the JSONL format:
{"id": "names-with-j", "label": "PERSON", "patterns": ["Jon", "John", "John Snow"]}
{"id": "names-with-s", "label": "PERSON", "patterns": ["Stark", "Snow"]}
{"id": "names-with-e", "label": "PERSON", "patterns": ["Eddard", "Eddard Stark"]}
In order to use a CSV file, an additional parameter “delimiter” needs to be set. In this case, the delimiter might be
set by using .setPatternsResource("patterns.csv", ReadAs.TEXT, Map("format"->"csv", "delimiter" -> "\\|"))
PERSON|Jon
PERSON|John
PERSON|John Snow
LOCATION|Winterfell
Input Annotator Types: DOCUMENT, TOKEN
Output Annotator Type: CHUNK
Python API: EntityRulerApproach | Scala API: EntityRulerApproach | Source: EntityRulerApproach |
Show Example
# In this example, the entities file as the form of
#
# PERSON|Jon
# PERSON|John
# PERSON|John Snow
# LOCATION|Winterfell
#
# where each line represents an entity and the associated string delimited by "|".
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.common import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
entityRuler = EntityRulerApproach() \
.setInputCols(["document", "token"]) \
.setOutputCol("entities") \
.setPatternsResource(
"patterns.csv",
ReadAs.TEXT,
{"format": "csv", "delimiter": "\\|"}
)
pipeline = Pipeline().setStages([
documentAssembler,
tokenizer,
entityRuler
])
data = spark.createDataFrame([["Jon Snow wants to be lord of Winterfell."]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("explode(entities)").show(truncate=False)
+--------------------------------------------------------------------+
|col |
+--------------------------------------------------------------------+
|[chunk, 0, 2, Jon, [entity -> PERSON, sentence -> 0], []] |
|[chunk, 29, 38, Winterfell, [entity -> LOCATION, sentence -> 0], []]|
+--------------------------------------------------------------------+
// In this example, the entities file as the form of
//
// PERSON|Jon
// PERSON|John
// PERSON|John Snow
// LOCATION|Winterfell
//
// where each line represents an entity and the associated string delimited by "|".
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.er.EntityRulerApproach
import com.johnsnowlabs.nlp.util.io.ReadAs
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val entityRuler = new EntityRulerApproach()
.setInputCols("document", "token")
.setOutputCol("entities")
.setPatternsResource(
"src/test/resources/entity-ruler/patterns.csv",
ReadAs.TEXT,
{"format": "csv", "delimiter": "|")}
)
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
tokenizer,
entityRuler
))
val data = Seq("Jon Snow wants to be lord of Winterfell.").toDF("text")
val result = pipeline.fit(data).transform(data)
result.selectExpr("explode(entities)").show(false)
+--------------------------------------------------------------------+
|col |
+--------------------------------------------------------------------+
|[chunk, 0, 2, Jon, [entity -> PERSON, sentence -> 0], []] |
|[chunk, 29, 38, Winterfell, [entity -> LOCATION, sentence -> 0], []]|
+--------------------------------------------------------------------+
Finisher
Converts annotation results into a format that easier to use. It is useful to extract the results from Spark NLP
Pipelines. The Finisher outputs annotation(s) values into String
.
For more extended examples on document pre-processing see the Examples.
Input Annotator Types: ANY
Output Annotator Type: NONE
Python API: Finisher | Scala API: Finisher | Source: Finisher |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
from sparknlp.pretrained import PretrainedPipeline
data = spark.createDataFrame([[1, "New York and New Jersey aren't that far apart actually."]]).toDF("id", "text")
# Extracts Named Entities amongst other things
pipeline = PretrainedPipeline("explain_document_dl")
finisher = Finisher().setInputCols("entities").setOutputCols("output")
explainResult = pipeline.transform(data)
explainResult.selectExpr("explode(entities)").show(truncate=False)
+------------------------------------------------------------------------------------------------------------------------------------------------------+
|entities |
+------------------------------------------------------------------------------------------------------------------------------------------------------+
|[[chunk, 0, 7, New York, [entity -> LOC, sentence -> 0, chunk -> 0], []], [chunk, 13, 22, New Jersey, [entity -> LOC, sentence -> 0, chunk -> 1], []]]|
+------------------------------------------------------------------------------------------------------------------------------------------------------+
result = finisher.transform(explainResult)
result.select("output").show(truncate=False)
+----------------------+
|output |
+----------------------+
|[New York, New Jersey]|
+----------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
import com.johnsnowlabs.nlp.Finisher
val data = Seq((1, "New York and New Jersey aren't that far apart actually.")).toDF("id", "text")
// Extracts Named Entities amongst other things
val pipeline = PretrainedPipeline("explain_document_dl")
val finisher = new Finisher().setInputCols("entities").setOutputCols("output")
val explainResult = pipeline.transform(data)
explainResult.selectExpr("explode(entities)").show(false)
+------------------------------------------------------------------------------------------------------------------------------------------------------+
|entities |
+------------------------------------------------------------------------------------------------------------------------------------------------------+
|[[chunk, 0, 7, New York, [entity -> LOC, sentence -> 0, chunk -> 0], []], [chunk, 13, 22, New Jersey, [entity -> LOC, sentence -> 0, chunk -> 1], []]]|
+------------------------------------------------------------------------------------------------------------------------------------------------------+
val result = finisher.transform(explainResult)
result.select("output").show(false)
+----------------------+
|output |
+----------------------+
|[New York, New Jersey]|
+----------------------+
GraphExtraction
Extracts a dependency graph between entities.
The GraphExtraction class takes e.g. extracted entities from a NerDLModel and creates a dependency tree which describes how the entities relate to each other. For that a triple store format is used. Nodes represent the entities and the edges represent the relations between those entities. The graph can then be used to find relevant relationships between words.
Both the DependencyParserModel and TypedDependencyParserModel need to be present in the pipeline. There are two ways to set them:
- Both Annotators are present in the pipeline already. The dependencies are taken implicitly from these two Annotators.
- Setting
setMergeEntities
totrue
will download the default pretrained models for those two Annotators automatically. The specific models can also be set withsetDependencyParserModel
andsetTypedDependencyParserModel
:val graph_extraction = new GraphExtraction() .setInputCols("document", "token", "ner") .setOutputCol("graph") .setRelationshipTypes(Array("prefer-LOC")) .setMergeEntities(true) //.setDependencyParserModel(Array("dependency_conllu", "en", "public/models")) //.setTypedDependencyParserModel(Array("dependency_typed_conllu", "en", "public/models"))
To transform the resulting graph into a more generic form such as RDF, see the GraphFinisher.
Input Annotator Types: DOCUMENT, TOKEN, NAMED_ENTITY
Output Annotator Type: NODE
Python API: GraphExtraction | Scala API: GraphExtraction | Source: GraphExtraction |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained() \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
nerTagger = NerDLModel.pretrained() \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
posTagger = PerceptronModel.pretrained() \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos")
dependencyParser = DependencyParserModel.pretrained() \
.setInputCols(["sentence", "pos", "token"]) \
.setOutputCol("dependency")
typedDependencyParser = TypedDependencyParserModel.pretrained() \
.setInputCols(["dependency", "pos", "token"]) \
.setOutputCol("dependency_type")
graph_extraction = GraphExtraction() \
.setInputCols(["document", "token", "ner"]) \
.setOutputCol("graph") \
.setRelationshipTypes(["prefer-LOC"])
pipeline = Pipeline().setStages([
documentAssembler,
sentence,
tokenizer,
embeddings,
nerTagger,
posTagger,
dependencyParser,
typedDependencyParser,
graph_extraction
])
data = spark.createDataFrame([["You and John prefer the morning flight through Denver"]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.select("graph").show(truncate=False)
+-----------------------------------------------------------------------------------------------------------------+
|graph |
+-----------------------------------------------------------------------------------------------------------------+
|13, 18, prefer, [relationship -> prefer,LOC, path1 -> prefer,nsubj,morning,flat,flight,flat,Denver], []|
+-----------------------------------------------------------------------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.ner.dl.NerDLModel
import com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel
import com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel
import com.johnsnowlabs.nlp.annotators.parser.dep.DependencyParserModel
import com.johnsnowlabs.nlp.annotators.parser.typdep.TypedDependencyParserModel
import org.apache.spark.ml.Pipeline
import com.johnsnowlabs.nlp.annotators.GraphExtraction
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained()
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val nerTagger = NerDLModel.pretrained()
.setInputCols("sentence", "token", "embeddings")
.setOutputCol("ner")
val posTagger = PerceptronModel.pretrained()
.setInputCols("sentence", "token")
.setOutputCol("pos")
val dependencyParser = DependencyParserModel.pretrained()
.setInputCols("sentence", "pos", "token")
.setOutputCol("dependency")
val typedDependencyParser = TypedDependencyParserModel.pretrained()
.setInputCols("dependency", "pos", "token")
.setOutputCol("dependency_type")
val graph_extraction = new GraphExtraction()
.setInputCols("document", "token", "ner")
.setOutputCol("graph")
.setRelationshipTypes(Array("prefer-LOC"))
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentence,
tokenizer,
embeddings,
nerTagger,
posTagger,
dependencyParser,
typedDependencyParser,
graph_extraction
))
val data = Seq("You and John prefer the morning flight through Denver").toDF("text")
val result = pipeline.fit(data).transform(data)
result.select("graph").show(false)
+-----------------------------------------------------------------------------------------------------------------+
|graph |
+-----------------------------------------------------------------------------------------------------------------+
|[[node, 13, 18, prefer, [relationship -> prefer,LOC, path1 -> prefer,nsubj,morning,flat,flight,flat,Denver], []]]|
+-----------------------------------------------------------------------------------------------------------------+
GraphFinisher
Helper class to convert the knowledge graph from GraphExtraction into a generic format, such as RDF.
Input Annotator Types: NONE
Output Annotator Type: NONE
Python API: GraphFinisher | Scala API: GraphFinisher | Source: GraphFinisher |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
# This is a continuation of the example of
# GraphExtraction. To see how the graph is extracted, see the
# documentation of that class.
graphFinisher = GraphFinisher() \
.setInputCol("graph") \
.setOutputCol("graph_finished")
.setOutputAs[False]
finishedResult = graphFinisher.transform(result)
finishedResult.select("text", "graph_finished").show(truncate=False)
+-----------------------------------------------------+-----------------------------------------------------------------------+
|text |graph_finished |
+-----------------------------------------------------+-----------------------------------------------------------------------+
|You and John prefer the morning flight through Denver|(morning,flat,flight), (flight,flat,Denver)|
+-----------------------------------------------------+-----------------------------------------------------------------------+
// This is a continuation of the example of
// [[com.johnsnowlabs.nlp.annotators.GraphExtraction GraphExtraction]]. To see how the graph is extracted, see the
// documentation of that class.
import com.johnsnowlabs.nlp.GraphFinisher
val graphFinisher = new GraphFinisher()
.setInputCol("graph")
.setOutputCol("graph_finished")
.setOutputAsArray(false)
val finishedResult = graphFinisher.transform(result)
finishedResult.select("text", "graph_finished").show(false)
+-----------------------------------------------------+-----------------------------------------------------------------------+
|text |graph_finished |
+-----------------------------------------------------+-----------------------------------------------------------------------+
|You and John prefer the morning flight through Denver|[[(prefer,nsubj,morning), (morning,flat,flight), (flight,flat,Denver)]]|
+-----------------------------------------------------+-----------------------------------------------------------------------+
ImageAssembler
Prepares images read by Spark into a format that is processable by Spark NLP. This component is needed to process images.
Input Annotator Types: NONE
Output Annotator Type: IMAGE
Python API: ImageAssembler | Scala API: ImageAssembler | Source: ImageAssembler |
Show Example
import sparknlp
from sparknlp.base import *
from pyspark.ml import Pipeline
data = spark.read.format("image").load("./tmp/images/").toDF("image")
imageAssembler = ImageAssembler().setInputCol("image").setOutputCol("image_assembler")
result = imageAssembler.transform(data)
result.select("image_assembler").show()
result.select("image_assembler").printSchema()
root
|-- image_assembler: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- annotatorType: string (nullable = true)
| | |-- origin: string (nullable = true)
| | |-- height: integer (nullable = true)
| | |-- width: integer (nullable = true)
| | |-- nChannels: integer (nullable = true)
| | |-- mode: integer (nullable = true)
| | |-- result: binary (nullable = true)
| | |-- metadata: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)
import com.johnsnowlabs.nlp.ImageAssembler
import org.apache.spark.ml.Pipeline
val imageDF: DataFrame = spark.read
.format("image")
.option("dropInvalid", value = true)
.load("src/test/resources/image/")
val imageAssembler = new ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
val pipeline = new Pipeline().setStages(Array(imageAssembler))
val pipelineDF = pipeline.fit(imageDF).transform(imageDF)
pipelineDF.printSchema()
root
|-- image_assembler: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- annotatorType: string (nullable = true)
| | |-- origin: string (nullable = true)
| | |-- height: integer (nullable = false)
| | |-- width: integer (nullable = false)
| | |-- nChannels: integer (nullable = false)
| | |-- mode: integer (nullable = false)
| | |-- result: binary (nullable = true)
| | |-- metadata: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)
LanguageDetectorDL
Language Identification and Detection by using CNN and RNN architectures in TensorFlow.
LanguageDetectorDL
is an annotator that detects the language of documents or sentences depending on the inputCols.
The models are trained on large datasets such as Wikipedia and Tatoeba.
Depending on the language (how similar the characters are), the LanguageDetectorDL works
best with text longer than 140 characters.
The output is a language code in Wiki Code style.
Pretrained models can be loaded with pretrained
of the companion object:
Val languageDetector = LanguageDetectorDL.pretrained()
.setInputCols("sentence")
.setOutputCol("language")
The default model is "ld_wiki_tatoeba_cnn_21"
, default language is "xx"
(meaning multi-lingual),
if no values are provided.
For available pretrained models please see the Models Hub.
For extended examples of usage, see the Examples And the LanguageDetectorDLTestSpec.
Input Annotator Types: DOCUMENT
Output Annotator Type: LANGUAGE
Python API: LanguageDetectorDL | Scala API: LanguageDetectorDL | Source: LanguageDetectorDL |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
languageDetector = LanguageDetectorDL.pretrained() \
.setInputCols("document") \
.setOutputCol("language")
pipeline = Pipeline() \
.setStages([
documentAssembler,
languageDetector
])
data = spark.createDataFrame([
["Spark NLP is an open-source text processing library for advanced natural language processing for the Python, Java and Scala programming languages."],
["Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala."],
["Spark NLP ist eine Open-Source-Textverarbeitungsbibliothek für fortgeschrittene natürliche Sprachverarbeitung für die Programmiersprachen Python, Java und Scala."]
]).toDF("text")
result = pipeline.fit(data).transform(data)
result.select("language.result").show(truncate=False)
+------+
|result|
+------+
|[en] |
|[fr] |
|[de] |
+------+
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.ld.dl.LanguageDetectorDL
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val languageDetector = LanguageDetectorDL.pretrained()
.setInputCols("document")
.setOutputCol("language")
val pipeline = new Pipeline()
.setStages(Array(
documentAssembler,
languageDetector
))
val data = Seq(
"Spark NLP is an open-source text processing library for advanced natural language processing for the Python, Java and Scala programming languages.",
"Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala.",
"Spark NLP ist eine Open-Source-Textverarbeitungsbibliothek für fortgeschrittene natürliche Sprachverarbeitung für die Programmiersprachen Python, Java und Scala."
).toDF("text")
val result = pipeline.fit(data).transform(data)
result.select("language.result").show(false)
+------+
|result|
+------+
|[en] |
|[fr] |
|[de] |
+------+
Lemmatizer
Class to find lemmas out of words with the objective of returning a base dictionary word.
Retrieves the significant part of a word. A dictionary of predefined lemmas must be provided with setDictionary
.
The dictionary can be set as a delimited text file.
Pretrained models can be loaded with LemmatizerModel.pretrained
.
For available pretrained models please see the Models Hub. For extended examples of usage, see the Examples.
Input Annotator Types: TOKEN
Output Annotator Type: TOKEN
Python API: Lemmatizer | Scala API: Lemmatizer | Source: Lemmatizer |
Show Example
# In this example, the lemma dictionary `lemmas_small.txt` has the form of
#
# ...
# pick -> pick picks picking picked
# peck -> peck pecking pecked pecks
# pickle -> pickle pickles pickled pickling
# pepper -> pepper peppers peppered peppering
# ...
#
# where each key is delimited by `->` and values are delimited by `\t`
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
lemmatizer = Lemmatizer() \
.setInputCols(["token"]) \
.setOutputCol("lemma") \
.setDictionary("src/test/resources/lemma-corpus-small/lemmas_small.txt", "->", "\t")
pipeline = Pipeline() \
.setStages([
documentAssembler,
sentenceDetector,
tokenizer,
lemmatizer
])
data = spark.createDataFrame([["Peter Pipers employees are picking pecks of pickled peppers."]]) \
.toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("lemma.result").show(truncate=False)
+------------------------------------------------------------------+
|result |
+------------------------------------------------------------------+
|[Peter, Pipers, employees, are, pick, peck, of, pickle, pepper, .]|
+------------------------------------------------------------------+
// In this example, the lemma dictionary `lemmas_small.txt` has the form of
//
// ...
// pick -> pick picks picking picked
// peck -> peck pecking pecked pecks
// pickle -> pickle pickles pickled pickling
// pepper -> pepper peppers peppered peppering
// ...
//
// where each key is delimited by `->` and values are delimited by `\t`
import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.Tokenizer
import com.johnsnowlabs.nlp.annotator.SentenceDetector
import com.johnsnowlabs.nlp.annotators.Lemmatizer
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val lemmatizer = new Lemmatizer()
.setInputCols(Array("token"))
.setOutputCol("lemma")
.setDictionary("src/test/resources/lemma-corpus-small/lemmas_small.txt", "->", "\t")
val pipeline = new Pipeline()
.setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
lemmatizer
))
val data = Seq("Peter Pipers employees are picking pecks of pickled peppers.")
.toDF("text")
val result = pipeline.fit(data).transform(data)
result.selectExpr("lemma.result").show(false)
+------------------------------------------------------------------+
|result |
+------------------------------------------------------------------+
|[Peter, Pipers, employees, are, pick, peck, of, pickle, pepper, .]|
+------------------------------------------------------------------+
MultiClassifierDL
Trains a MultiClassifierDL for Multi-label Text Classification.
MultiClassifierDL uses a Bidirectional GRU with a convolutional model that we have built inside TensorFlow and supports up to 100 classes.
For instantiated/pretrained models, see MultiClassifierDLModel.
The input to MultiClassifierDL
are Sentence Embeddings such as the state-of-the-art
UniversalSentenceEncoder,
BertSentenceEmbeddings or
SentenceEmbeddings.
In machine learning, multi-label classification and the strongly related problem of multi-output classification are variants of the classification problem where multiple labels may be assigned to each instance. Multi-label classification is a generalization of multiclass classification, which is the single-label problem of categorizing instances into precisely one of more than two classes; in the multi-label problem there is no constraint on how many of the classes the instance can be assigned to. Formally, multi-label classification is the problem of finding a model that maps inputs x to binary vectors y (assigning a value of 0 or 1 for each element (label) in y).
Setting a test dataset to monitor model metrics can be done with .setTestDataset
. The method
expects a path to a parquet file containing a dataframe that has the same required columns as
the training dataframe. The pre-processing steps for the training dataframe should also be
applied to the test dataframe. The following example will show how to create the test dataset:
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val embeddings = UniversalSentenceEncoder.pretrained()
.setInputCols("document")
.setOutputCol("sentence_embeddings")
val preProcessingPipeline = new Pipeline().setStages(Array(documentAssembler, embeddings))
val Array(train, test) = data.randomSplit(Array(0.8, 0.2))
preProcessingPipeline
.fit(test)
.transform(test)
.write
.mode("overwrite")
.parquet("test_data")
val multiClassifier = new MultiClassifierDLApproach()
.setInputCols("sentence_embeddings")
.setOutputCol("category")
.setLabelColumn("label")
.setTestDataset("test_data")
For extended examples of usage, see the Examples and the MultiClassifierDLTestSpec.
Input Annotator Types: SENTENCE_EMBEDDINGS
Output Annotator Type: CATEGORY
Note: This annotator accepts a label column of a single item in either type of String, Int, Float, or Double. UniversalSentenceEncoder, BertSentenceEmbeddings, SentenceEmbeddings or other sentence based embeddings can be used for the inputCol
Python API: MultiClassifierDLApproach | Scala API: MultiClassifierDLApproach | Source: MultiClassifierDLApproach |
Show Example
# In this example, the training data has the form
#
# +----------------+--------------------+--------------------+
# | id| text| labels|
# +----------------+--------------------+--------------------+
# |ed58abb40640f983|PN NewsYou mean ... | [toxic]|
# |a1237f726b5f5d89|Dude. Place the ...| [obscene, insult]|
# |24b0d6c8733c2abe|Thanks - thanks ...| [insult]|
# |8c4478fb239bcfc0|" Gee, 5 minutes ...|[toxic, obscene, ...|
# +----------------+--------------------+--------------------+
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
# Process training data to create text with associated array of labels
trainDataset.printSchema()
# root
# |-- id: string (nullable = true)
# |-- text: string (nullable = true)
# |-- labels: array (nullable = true)
# | |-- element: string (containsNull = true)
# Then create pipeline for training
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document") \
.setCleanupMode("shrink")
embeddings = UniversalSentenceEncoder.pretrained() \
.setInputCols("document") \
.setOutputCol("embeddings")
docClassifier = MultiClassifierDLApproach() \
.setInputCols("embeddings") \
.setOutputCol("category") \
.setLabelColumn("labels") \
.setBatchSize(128) \
.setMaxEpochs(10) \
.setLr(1e-3) \
.setThreshold(0.5) \
.setValidationSplit(0.1)
pipeline = Pipeline() \
.setStages(
[
documentAssembler,
embeddings,
docClassifier
]
)
pipelineModel = pipeline.fit(trainDataset)
// In this example, the training data has the form (Note: labels can be arbitrary)
//
// mr,ref
// "name[Alimentum], area[city centre], familyFriendly[no], near[Burger King]",Alimentum is an adult establish found in the city centre area near Burger King.
// "name[Alimentum], area[city centre], familyFriendly[yes]",Alimentum is a family-friendly place in the city centre.
// ...
//
// It needs some pre-processing first, so the labels are of type `Array[String]`. This can be done like so:
import spark.implicits._
import com.johnsnowlabs.nlp.annotators.classifier.dl.MultiClassifierDLApproach
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.embeddings.UniversalSentenceEncoder
import org.apache.spark.ml.Pipeline
import org.apache.spark.sql.functions.{col, udf}
// Process training data to create text with associated array of labels
def splitAndTrim = udf { labels: String =>
labels.split(", ").map(x=>x.trim)
}
val smallCorpus = spark.read
.option("header", true)
.option("inferSchema", true)
.option("mode", "DROPMALFORMED")
.csv("src/test/resources/classifier/e2e.csv")
.withColumn("labels", splitAndTrim(col("mr")))
.withColumn("text", col("ref"))
.drop("mr")
smallCorpus.printSchema()
// root
// |-- ref: string (nullable = true)
// |-- labels: array (nullable = true)
// | |-- element: string (containsNull = true)
// Then create pipeline for training
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
.setCleanupMode("shrink")
val embeddings = UniversalSentenceEncoder.pretrained()
.setInputCols("document")
.setOutputCol("embeddings")
val docClassifier = new MultiClassifierDLApproach()
.setInputCols("embeddings")
.setOutputCol("category")
.setLabelColumn("labels")
.setBatchSize(128)
.setMaxEpochs(10)
.setLr(1e-3f)
.setThreshold(0.5f)
.setValidationSplit(0.1f)
val pipeline = new Pipeline()
.setStages(
Array(
documentAssembler,
embeddings,
docClassifier
)
)
val pipelineModel = pipeline.fit(smallCorpus)
MultiDateMatcher
Matches standard date formats into a provided format.
Reads the following kind of dates:
"1978-01-28", "1984/04/02,1/02/1980", "2/28/79", "The 31st of April in the year 2008",
"Fri, 21 Nov 1997", "Jan 21, ‘97", "Sun", "Nov 21", "jan 1st", "next thursday",
"last wednesday", "today", "tomorrow", "yesterday", "next week", "next month",
"next year", "day after", "the day before", "0600h", "06:00 hours", "6pm", "5:30 a.m.",
"at 5", "12:59", "23:59", "1988/11/23 6pm", "next week at 7.30", "5 am tomorrow"
For example "The 31st of April in the year 2008"
will be converted into 2008/04/31
.
For extended examples of usage, see the Examples and the MultiDateMatcherTestSpec.
Input Annotator Types: DOCUMENT
Output Annotator Type: DATE
Python API: MultiDateMatcher | Scala API: MultiDateMatcher | Source: MultiDateMatcher |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
date = MultiDateMatcher() \
.setInputCols("document") \
.setOutputCol("date") \
.setAnchorDateYear(2020) \
.setAnchorDateMonth(1) \
.setAnchorDateDay(11) \
.setDateFormat("yyyy/MM/dd")
pipeline = Pipeline().setStages([
documentAssembler,
date
])
data = spark.createDataFrame([["I saw him yesterday and he told me that he will visit us next week"]]) \
.toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("explode(date) as dates").show(truncate=False)
+-----------------------------------------------+
|dates |
+-----------------------------------------------+
|[date, 57, 65, 2020/01/18, [sentence -> 0], []]|
|[date, 10, 18, 2020/01/10, [sentence -> 0], []]|
+-----------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.MultiDateMatcher
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val date = new MultiDateMatcher()
.setInputCols("document")
.setOutputCol("date")
.setAnchorDateYear(2020)
.setAnchorDateMonth(1)
.setAnchorDateDay(11)
.setDateFormat("yyyy/MM/dd")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
date
))
val data = Seq("I saw him yesterday and he told me that he will visit us next week")
.toDF("text")
val result = pipeline.fit(data).transform(data)
result.selectExpr("explode(date) as dates").show(false)
+-----------------------------------------------+
|dates |
+-----------------------------------------------+
|[date, 57, 65, 2020/01/18, [sentence -> 0], []]|
|[date, 10, 18, 2020/01/10, [sentence -> 0], []]|
+-----------------------------------------------+
MultiDocumentAssembler
Prepares data into a format that is processable by Spark NLP. This is the entry point for
every Spark NLP pipeline. The MultiDocumentAssembler
can read either a String
column or an
Array[String]
. Additionally, MultiDocumentAssembler.setCleanupMode can be used to
pre-process the text (Default: disabled
). For possible options please refer the parameters
section.
For more extended examples on document pre-processing see the Examples.
Input Annotator Types: NONE
Output Annotator Type: DOCUMENT
Python API: MultiDocumentAssembler | Scala API: MultiDocumentAssembler | Source: MultiDocumentAssembler |
Show Example
import sparknlp
from sparknlp.base import *
from pyspark.ml import Pipeline
data = spark.createDataFrame([["Spark NLP is an open-source text processing library."], ["Spark NLP is a state-of-the-art Natural Language Processing library built on top of Apache Spark"]]).toDF("text", "text2")
documentAssembler = MultiDocumentAssembler().setInputCols(["text", "text2"]).setOutputCols(["document1", "document2"])
result = documentAssembler.transform(data)
result.select("document1").show(truncate=False)
+----------------------------------------------------------------------------------------------+
|document1 |
+----------------------------------------------------------------------------------------------+
|[[document, 0, 51, Spark NLP is an open-source text processing library., [sentence -> 0], []]]|
+----------------------------------------------------------------------------------------------+
result.select("document1").printSchema()
root
|-- document: array (nullable = True)
| |-- element: struct (containsNull = True)
| | |-- annotatorType: string (nullable = True)
| | |-- begin: integer (nullable = False)
| | |-- end: integer (nullable = False)
| | |-- result: string (nullable = True)
| | |-- metadata: map (nullable = True)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = True)
| | |-- embeddings: array (nullable = True)
| | | |-- element: float (containsNull = False)
import spark.implicits._
import com.johnsnowlabs.nlp.MultiDocumentAssembler
val data = Seq("Spark NLP is an open-source text processing library.").toDF("text")
val multiDocumentAssembler = new MultiDocumentAssembler().setInputCols("text").setOutputCols("document")
val result = multiDocumentAssembler.transform(data)
result.select("document").show(false)
+----------------------------------------------------------------------------------------------+
|document |
+----------------------------------------------------------------------------------------------+
|[[document, 0, 51, Spark NLP is an open-source text processing library., [sentence -> 0], []]]|
+----------------------------------------------------------------------------------------------+
result.select("document").printSchema
root
|-- document: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- annotatorType: string (nullable = true)
| | |-- begin: integer (nullable = false)
| | |-- end: integer (nullable = false)
| | |-- result: string (nullable = true)
| | |-- metadata: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)
| | |-- embeddings: array (nullable = true)
| | | |-- element: float (containsNull = false)
NGramGenerator
A feature transformer that converts the input array of strings (annotatorType TOKEN) into an array of n-grams (annotatorType CHUNK). Null values in the input array are ignored. It returns an array of n-grams where each n-gram is represented by a space-separated string of words.
When the input is empty, an empty array is returned. When the input array length is less than n (number of elements per n-gram), no n-grams are returned.
For more extended examples see the Examples and the NGramGeneratorTestSpec.
Input Annotator Types: TOKEN
Output Annotator Type: CHUNK
Python API: NGramGenerator | Scala API: NGramGenerator | Source: NGramGenerator |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
nGrams = NGramGenerator() \
.setInputCols(["token"]) \
.setOutputCol("ngrams") \
.setN(2)
pipeline = Pipeline().setStages([
documentAssembler,
sentence,
tokenizer,
nGrams
])
data = spark.createDataFrame([["This is my sentence."]]).toDF("text")
results = pipeline.fit(data).transform(data)
results.selectExpr("explode(ngrams) as result").show(truncate=False)
+------------------------------------------------------------+
|result |
+------------------------------------------------------------+
|[chunk, 0, 6, This is, [sentence -> 0, chunk -> 0], []] |
|[chunk, 5, 9, is my, [sentence -> 0, chunk -> 1], []] |
|[chunk, 8, 18, my sentence, [sentence -> 0, chunk -> 2], []]|
|[chunk, 11, 19, sentence ., [sentence -> 0, chunk -> 3], []]|
+------------------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.SentenceDetector
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.NGramGenerator
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val nGrams = new NGramGenerator()
.setInputCols("token")
.setOutputCol("ngrams")
.setN(2)
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentence,
tokenizer,
nGrams
))
val data = Seq("This is my sentence.").toDF("text")
val results = pipeline.fit(data).transform(data)
results.selectExpr("explode(ngrams) as result").show(false)
+------------------------------------------------------------+
|result |
+------------------------------------------------------------+
|[chunk, 0, 6, This is, [sentence -> 0, chunk -> 0], []] |
|[chunk, 5, 9, is my, [sentence -> 0, chunk -> 1], []] |
|[chunk, 8, 18, my sentence, [sentence -> 0, chunk -> 2], []]|
|[chunk, 11, 19, sentence ., [sentence -> 0, chunk -> 3], []]|
+------------------------------------------------------------+
NerConverter
Converts a IOB or IOB2 representation of NER to a user-friendly one,
by associating the tokens of recognized entities and their label. Results in CHUNK
Annotation type.
NER chunks can then be filtered by setting a whitelist with setWhiteList
.
Chunks with no associated entity (tagged “O”) are filtered.
See also Inside–outside–beginning (tagging) for more information.
Input Annotator Types: DOCUMENT, TOKEN, NAMED_ENTITY
Output Annotator Type: CHUNK
Python API: NerConverter | Scala API: NerConverter | Source: NerConverter |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
# This is a continuation of the example of the NerDLModel. See that class
# on how to extract the entities.
# The output of the NerDLModel follows the Annotator schema and can be converted like so:
#
# result.selectExpr("explode(ner)").show(truncate=False)
# +----------------------------------------------------+
# |col |
# +----------------------------------------------------+
# |[named_entity, 0, 2, B-ORG, [word -> U.N], []] |
# |[named_entity, 3, 3, O, [word -> .], []] |
# |[named_entity, 5, 12, O, [word -> official], []] |
# |[named_entity, 14, 18, B-PER, [word -> Ekeus], []] |
# |[named_entity, 20, 24, O, [word -> heads], []] |
# |[named_entity, 26, 28, O, [word -> for], []] |
# |[named_entity, 30, 36, B-LOC, [word -> Baghdad], []]|
# |[named_entity, 37, 37, O, [word -> .], []] |
# +----------------------------------------------------+
#
# After the converter is used:
converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("entities")
converter.transform(result).selectExpr("explode(entities)").show(truncate=False)
+------------------------------------------------------------------------+
|col |
+------------------------------------------------------------------------+
|[chunk, 0, 2, U.N, [entity -> ORG, sentence -> 0, chunk -> 0], []] |
|[chunk, 14, 18, Ekeus, [entity -> PER, sentence -> 0, chunk -> 1], []] |
|[chunk, 30, 36, Baghdad, [entity -> LOC, sentence -> 0, chunk -> 2], []]|
+------------------------------------------------------------------------+
// This is a continuation of the example of the [[com.johnsnowlabs.nlp.annotators.ner.dl.NerDLModel NerDLModel]]. See that class
// on how to extract the entities.
// The output of the NerDLModel follows the Annotator schema and can be converted like so:
//
// result.selectExpr("explode(ner)").show(false)
// +----------------------------------------------------+
// |col |
// +----------------------------------------------------+
// |[named_entity, 0, 2, B-ORG, [word -> U.N], []] |
// |[named_entity, 3, 3, O, [word -> .], []] |
// |[named_entity, 5, 12, O, [word -> official], []] |
// |[named_entity, 14, 18, B-PER, [word -> Ekeus], []] |
// |[named_entity, 20, 24, O, [word -> heads], []] |
// |[named_entity, 26, 28, O, [word -> for], []] |
// |[named_entity, 30, 36, B-LOC, [word -> Baghdad], []]|
// |[named_entity, 37, 37, O, [word -> .], []] |
// +----------------------------------------------------+
//
// After the converter is used:
val converter = new NerConverter()
.setInputCols("sentence", "token", "ner")
.setOutputCol("entities")
.setPreservePosition(false)
converter.transform(result).selectExpr("explode(entities)").show(false)
+------------------------------------------------------------------------+
|col |
+------------------------------------------------------------------------+
|[chunk, 0, 2, U.N, [entity -> ORG, sentence -> 0, chunk -> 0], []] |
|[chunk, 14, 18, Ekeus, [entity -> PER, sentence -> 0, chunk -> 1], []] |
|[chunk, 30, 36, Baghdad, [entity -> LOC, sentence -> 0, chunk -> 2], []]|
+------------------------------------------------------------------------+
NerCrf
Algorithm for training a Named Entity Recognition Model
For instantiated/pretrained models, see NerCrfModel.
This Named Entity recognition annotator allows for a generic model to be trained by utilizing a CRF machine learning
algorithm. The training data should be a labeled Spark Dataset, e.g. CoNLL 2003 IOB with
Annotation
type columns. The data should have columns of type DOCUMENT, TOKEN, POS, WORD_EMBEDDINGS
and an
additional label column of annotator type NAMED_ENTITY
.
Excluding the label, this can be done with for example
- a SentenceDetector,
- a Tokenizer and
- a PerceptronModel and
- a WordEmbeddingsModel (any word embeddings can be chosen, e.g. BertEmbeddings for BERT based embeddings).
Optionally the user can provide an entity dictionary file with setExternalFeatures for better accuracy.
For extended examples of usage, see the Examples and the NerCrfApproachTestSpec.
Input Annotator Types: DOCUMENT, TOKEN, POS, WORD_EMBEDDINGS
Output Annotator Type: NAMED_ENTITY
Python API: NerCrfApproach | Scala API: NerCrfApproach | Source: NerCrfApproach |
Show Example
# This CoNLL dataset already includes the sentence, token, pos and label column with their respective annotator types.
# If a custom dataset is used, these need to be defined.
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
embeddings = WordEmbeddingsModel.pretrained() \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(False)
nerTagger = NerCrfApproach() \
.setInputCols(["sentence", "token", "pos", "embeddings"]) \
.setLabelColumn("label") \
.setMinEpochs(1) \
.setMaxEpochs(3) \
.setC0(34) \
.setL2(3.0) \
.setOutputCol("ner")
pipeline = Pipeline().setStages([
documentAssembler,
embeddings,
nerTagger
])
conll = CoNLL()
trainingData = conll.readDataset(spark, "src/test/resources/conll2003/eng.train")
pipelineModel = pipeline.fit(trainingData)
// This CoNLL dataset already includes the sentence, token, pos and label column with their respective annotator types.
// If a custom dataset is used, these need to be defined.
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel
import com.johnsnowlabs.nlp.annotator.NerCrfApproach
import com.johnsnowlabs.nlp.training.CoNLL
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val embeddings = WordEmbeddingsModel.pretrained()
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
.setCaseSensitive(false)
val nerTagger = new NerCrfApproach()
.setInputCols("sentence", "token", "pos", "embeddings")
.setLabelColumn("label")
.setMinEpochs(1)
.setMaxEpochs(3)
.setC0(34)
.setL2(3.0)
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
embeddings,
nerTagger
))
val conll = CoNLL()
val trainingData = conll.readDataset(spark, "src/test/resources/conll2003/eng.train")
val pipelineModel = pipeline.fit(trainingData)
NerDL
This Named Entity recognition annotator allows to train generic NER model based on Neural Networks.
The architecture of the neural network is a Char CNNs - BiLSTM - CRF that achieves state-of-the-art in most datasets.
For instantiated/pretrained models, see NerDLModel.
The training data should be a labeled Spark Dataset, in the format of CoNLL
2003 IOB with Annotation
type columns. The data should have columns of type DOCUMENT, TOKEN, WORD_EMBEDDINGS
and an
additional label column of annotator type NAMED_ENTITY
.
Excluding the label, this can be done with for example
- a SentenceDetector,
- a Tokenizer and
- a PerceptronModel and
- a WordEmbeddingsModel (any word embeddings can be chosen, e.g. BertEmbeddings for BERT based embeddings).
Setting a test dataset to monitor model metrics can be done with .setTestDataset
. The method
expects a path to a parquet file containing a dataframe that has the same required columns as
the training dataframe. The pre-processing steps for the training dataframe should also be
applied to the test dataframe. The following example will show how to create the test dataset
with a CoNLL dataset:
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val embeddings = WordEmbeddingsModel
.pretrained()
.setInputCols("document", "token")
.setOutputCol("embeddings")
val preProcessingPipeline = new Pipeline().setStages(Array(documentAssembler, embeddings))
val conll = CoNLL()
val Array(train, test) = conll
.readDataset(spark, "src/test/resources/conll2003/eng.train")
.randomSplit(Array(0.8, 0.2))
preProcessingPipeline
.fit(test)
.transform(test)
.write
.mode("overwrite")
.parquet("test_data")
val nerTagger = new NerDLApproach()
.setInputCols("document", "token", "embeddings")
.setLabelColumn("label")
.setOutputCol("ner")
.setTestDataset("test_data")
For extended examples of usage, see the Examples and the NerDLSpec.
Input Annotator Types: DOCUMENT, TOKEN, WORD_EMBEDDINGS
Output Annotator Type: NAMED_ENTITY
Python API: NerDLApproach | Scala API: NerDLApproach | Source: NerDLApproach |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline
# First extract the prerequisites for the NerDLApproach
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained() \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
# Then the training can start
nerTagger = NerDLApproach() \
.setInputCols(["sentence", "token", "embeddings"]) \
.setLabelColumn("label") \
.setOutputCol("ner") \
.setMaxEpochs(1) \
.setRandomSeed(0) \
.setVerbose(0)
pipeline = Pipeline().setStages([
documentAssembler,
sentence,
tokenizer,
embeddings,
nerTagger
])
# We use the text and labels from the CoNLL dataset
conll = CoNLL()
trainingData = conll.readDataset(spark, "src/test/resources/conll2003/eng.train")
pipelineModel = pipeline.fit(trainingData)
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
import com.johnsnowlabs.nlp.embeddings.BertEmbeddings
import com.johnsnowlabs.nlp.annotators.ner.dl.NerDLApproach
import com.johnsnowlabs.nlp.training.CoNLL
import org.apache.spark.ml.Pipeline
// First extract the prerequisites for the NerDLApproach
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained()
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
// Then the training can start
val nerTagger = new NerDLApproach()
.setInputCols("sentence", "token", "embeddings")
.setLabelColumn("label")
.setOutputCol("ner")
.setMaxEpochs(1)
.setRandomSeed(0)
.setVerbose(0)
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentence,
tokenizer,
embeddings,
nerTagger
))
// We use the text and labels from the CoNLL dataset
val conll = CoNLL()
val trainingData = conll.readDataset(spark, "src/test/resources/conll2003/eng.train")
val pipelineModel = pipeline.fit(trainingData)
NerOverwriter
Overwrites entities of specified strings.
The input for this Annotator have to be entities that are already extracted, Annotator type NAMED_ENTITY
.
The strings specified with setStopWords
will have new entities assigned to, specified with setNewResult
.
Input Annotator Types: NAMED_ENTITY
Output Annotator Type: NAMED_ENTITY
Python API: NerOverwriter | Scala API: NerOverwriter | Source: NerOverwriter |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
# First extract the prerequisite Entities
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained() \
.setInputCols(["sentence", "token"]) \
.setOutputCol("bert")
nerTagger = NerDLModel.pretrained() \
.setInputCols(["sentence", "token", "bert"]) \
.setOutputCol("ner")
pipeline = Pipeline().setStages([
documentAssembler,
sentence,
tokenizer,
embeddings,
nerTagger
])
data = spark.createDataFrame([["Spark NLP Crosses Five Million Downloads, John Snow Labs Announces."]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("explode(ner)").show(truncate=False)
# +------------------------------------------------------+
# |col |
# +------------------------------------------------------+
# |[named_entity, 0, 4, B-ORG, [word -> Spark], []] |
# |[named_entity, 6, 8, I-ORG, [word -> NLP], []] |
# |[named_entity, 10, 16, O, [word -> Crosses], []] |
# |[named_entity, 18, 21, O, [word -> Five], []] |
# |[named_entity, 23, 29, O, [word -> Million], []] |
# |[named_entity, 31, 39, O, [word -> Downloads], []] |
# |[named_entity, 40, 40, O, [word -> ,], []] |
# |[named_entity, 42, 45, B-ORG, [word -> John], []] |
# |[named_entity, 47, 50, I-ORG, [word -> Snow], []] |
# |[named_entity, 52, 55, I-ORG, [word -> Labs], []] |
# |[named_entity, 57, 65, I-ORG, [word -> Announces], []]|
# |[named_entity, 66, 66, O, [word -> .], []] |
# +------------------------------------------------------+
# The recognized entities can then be overwritten
nerOverwriter = NerOverwriter() \
.setInputCols(["ner"]) \
.setOutputCol("ner_overwritten") \
.setStopWords(["Million"]) \
.setNewResult("B-CARDINAL")
nerOverwriter.transform(result).selectExpr("explode(ner_overwritten)").show(truncate=False)
+---------------------------------------------------------+
|col |
+---------------------------------------------------------+
|[named_entity, 0, 4, B-ORG, [word -> Spark], []] |
|[named_entity, 6, 8, I-ORG, [word -> NLP], []] |
|[named_entity, 10, 16, O, [word -> Crosses], []] |
|[named_entity, 18, 21, O, [word -> Five], []] |
|[named_entity, 23, 29, B-CARDINAL, [word -> Million], []]|
|[named_entity, 31, 39, O, [word -> Downloads], []] |
|[named_entity, 40, 40, O, [word -> ,], []] |
|[named_entity, 42, 45, B-ORG, [word -> John], []] |
|[named_entity, 47, 50, I-ORG, [word -> Snow], []] |
|[named_entity, 52, 55, I-ORG, [word -> Labs], []] |
|[named_entity, 57, 65, I-ORG, [word -> Announces], []] |
|[named_entity, 66, 66, O, [word -> .], []] |
+---------------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
import com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel
import com.johnsnowlabs.nlp.annotators.ner.dl.NerDLModel
import com.johnsnowlabs.nlp.annotators.ner.NerOverwriter
import org.apache.spark.ml.Pipeline
// First extract the prerequisite Entities
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained()
.setInputCols("sentence", "token")
.setOutputCol("bert")
val nerTagger = NerDLModel.pretrained()
.setInputCols("sentence", "token", "bert")
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentence,
tokenizer,
embeddings,
nerTagger
))
val data = Seq("Spark NLP Crosses Five Million Downloads, John Snow Labs Announces.").toDF("text")
val result = pipeline.fit(data).transform(data)
result.selectExpr("explode(ner)").show(false)
/
+------------------------------------------------------+
|col |
+------------------------------------------------------+
|[named_entity, 0, 4, B-ORG, [word -> Spark], []] |
|[named_entity, 6, 8, I-ORG, [word -> NLP], []] |
|[named_entity, 10, 16, O, [word -> Crosses], []] |
|[named_entity, 18, 21, O, [word -> Five], []] |
|[named_entity, 23, 29, O, [word -> Million], []] |
|[named_entity, 31, 39, O, [word -> Downloads], []] |
|[named_entity, 40, 40, O, [word -> ,], []] |
|[named_entity, 42, 45, B-ORG, [word -> John], []] |
|[named_entity, 47, 50, I-ORG, [word -> Snow], []] |
|[named_entity, 52, 55, I-ORG, [word -> Labs], []] |
|[named_entity, 57, 65, I-ORG, [word -> Announces], []]|
|[named_entity, 66, 66, O, [word -> .], []] |
+------------------------------------------------------+
/
// The recognized entities can then be overwritten
val nerOverwriter = new NerOverwriter()
.setInputCols("ner")
.setOutputCol("ner_overwritten")
.setStopWords(Array("Million"))
.setNewResult("B-CARDINAL")
nerOverwriter.transform(result).selectExpr("explode(ner_overwritten)").show(false)
+---------------------------------------------------------+
|col |
+---------------------------------------------------------+
|[named_entity, 0, 4, B-ORG, [word -> Spark], []] |
|[named_entity, 6, 8, I-ORG, [word -> NLP], []] |
|[named_entity, 10, 16, O, [word -> Crosses], []] |
|[named_entity, 18, 21, O, [word -> Five], []] |
|[named_entity, 23, 29, B-CARDINAL, [word -> Million], []]|
|[named_entity, 31, 39, O, [word -> Downloads], []] |
|[named_entity, 40, 40, O, [word -> ,], []] |
|[named_entity, 42, 45, B-ORG, [word -> John], []] |
|[named_entity, 47, 50, I-ORG, [word -> Snow], []] |
|[named_entity, 52, 55, I-ORG, [word -> Labs], []] |
|[named_entity, 57, 65, I-ORG, [word -> Announces], []] |
|[named_entity, 66, 66, O, [word -> .], []] |
+---------------------------------------------------------+
Normalizer
Annotator that cleans out tokens. Requires stems, hence tokens. Removes all dirty characters from text following a regex pattern and transforms words based on a provided dictionary
For extended examples of usage, see the Examples.
Input Annotator Types: TOKEN
Output Annotator Type: TOKEN
Python API: Normalizer | Scala API: Normalizer | Source: Normalizer |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
normalizer = Normalizer() \
.setInputCols(["token"]) \
.setOutputCol("normalized") \
.setLowercase(True) \
.setCleanupPatterns(["""[^\w\d\s]"""]) # remove punctuations (keep alphanumeric chars)
# if we don't set CleanupPatterns, it will only keep alphabet letters ([^A-Za-z])
pipeline = Pipeline().setStages([
documentAssembler,
tokenizer,
normalizer
])
data = spark.createDataFrame([["John and Peter are brothers. However they don't support each other that much."]]) \
.toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("normalized.result").show(truncate = False)
+----------------------------------------------------------------------------------------+
|result |
+----------------------------------------------------------------------------------------+
|[john, and, peter, are, brothers, however, they, dont, support, each, other, that, much]|
+----------------------------------------------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.{Normalizer, Tokenizer}
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val normalizer = new Normalizer()
.setInputCols("token")
.setOutputCol("normalized")
.setLowercase(true)
.setCleanupPatterns(Array("""[^\w\d\s]""")) // remove punctuations (keep alphanumeric chars)
// if we don't set CleanupPatterns, it will only keep alphabet letters ([^A-Za-z])
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
tokenizer,
normalizer
))
val data = Seq("John and Peter are brothers. However they don't support each other that much.")
.toDF("text")
val result = pipeline.fit(data).transform(data)
result.selectExpr("normalized.result").show(truncate = false)
+----------------------------------------------------------------------------------------+
|result |
+----------------------------------------------------------------------------------------+
|[john, and, peter, are, brothers, however, they, dont, support, each, other, that, much]|
+----------------------------------------------------------------------------------------+
NorvigSweeting Spellchecker
Trains annotator, that retrieves tokens and makes corrections automatically if not found in an English dictionary.
The Symmetric Delete spelling correction algorithm reduces the complexity of edit candidate generation and
dictionary lookup for a given Damerau-Levenshtein distance. It is six orders of magnitude faster
(than the standard approach with deletes + transposes + replaces + inserts) and language independent.
A dictionary of correct spellings must be provided with setDictionary
as a text file, where each word is parsed by a regex pattern.
Inspired by Norvig model and SymSpell.
For instantiated/pretrained models, see NorvigSweetingModel.
For extended examples of usage, see the NorvigSweetingTestSpec.
Input Annotator Types: TOKEN
Output Annotator Type: TOKEN
Python API: NorvigSweetingApproach | Scala API: NorvigSweetingApproach | Source: NorvigSweetingApproach |
Show Example
# In this example, the dictionary `"words.txt"` has the form of
#
# ...
# gummy
# gummic
# gummier
# gummiest
# gummiferous
# ...
#
# This dictionary is then set to be the basis of the spell checker.
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
spellChecker = NorvigSweetingApproach() \
.setInputCols(["token"]) \
.setOutputCol("spell") \
.setDictionary("src/test/resources/spell/words.txt")
pipeline = Pipeline().setStages([
documentAssembler,
tokenizer,
spellChecker
])
pipelineModel = pipeline.fit(trainingData)
// In this example, the dictionary `"words.txt"` has the form of
//
// ...
// gummy
// gummic
// gummier
// gummiest
// gummiferous
// ...
//
// This dictionary is then set to be the basis of the spell checker.
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.spell.norvig.NorvigSweetingApproach
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val spellChecker = new NorvigSweetingApproach()
.setInputCols("token")
.setOutputCol("spell")
.setDictionary("src/test/resources/spell/words.txt")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
tokenizer,
spellChecker
))
val pipelineModel = pipeline.fit(trainingData)
POSTagger (Part of speech tagger)
Trains an averaged Perceptron model to tag words part-of-speech. Sets a POS tag to each word within a sentence.
For pretrained models please see the PerceptronModel.
The training data needs to be in a Spark DataFrame, where the column needs to consist of
Annotations of type POS
. The Annotation
needs to have member result
set to the POS tag and have a "word"
mapping to its word inside of member metadata
.
This DataFrame for training can easily created by the helper class POS.
POS().readDataset(spark, datasetPath).selectExpr("explode(tags) as tags").show(false)
+---------------------------------------------+
|tags |
+---------------------------------------------+
|[pos, 0, 5, NNP, [word -> Pierre], []] |
|[pos, 7, 12, NNP, [word -> Vinken], []] |
|[pos, 14, 14, ,, [word -> ,], []] |
|[pos, 31, 34, MD, [word -> will], []] |
|[pos, 36, 39, VB, [word -> join], []] |
|[pos, 41, 43, DT, [word -> the], []] |
|[pos, 45, 49, NN, [word -> board], []] |
...
For extended examples of usage, see the Examples and PerceptronApproach tests.
Input Annotator Types: TOKEN, DOCUMENT
Output Annotator Type: POS
Python API: PerceptronApproach | Scala API: PerceptronApproach | Source: PerceptronApproach |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
datasetPath = "src/test/resources/anc-pos-corpus-small/test-training.txt"
trainingPerceptronDF = POS().readDataset(spark, datasetPath)
trainedPos = PerceptronApproach() \
.setInputCols(["document", "token"]) \
.setOutputCol("pos") \
.setPosColumn("tags") \
.fit(trainingPerceptronDF)
pipeline = Pipeline().setStages([
documentAssembler,
sentence,
tokenizer,
trainedPos
])
data = spark.createDataFrame([["To be or not to be, is this the question?"]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("pos.result").show(truncate=False)
+--------------------------------------------------+
|result |
+--------------------------------------------------+
|[NNP, NNP, CD, JJ, NNP, NNP, ,, MD, VB, DT, CD, .]|
+--------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.SentenceDetector
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.training.POS
import com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronApproach
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val datasetPath = "src/test/resources/anc-pos-corpus-small/test-training.txt"
val trainingPerceptronDF = POS().readDataset(spark, datasetPath)
val trainedPos = new PerceptronApproach()
.setInputCols("document", "token")
.setOutputCol("pos")
.setPosColumn("tags")
.fit(trainingPerceptronDF)
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentence,
tokenizer,
trainedPos
))
val data = Seq("To be or not to be, is this the question?").toDF("text")
val result = pipeline.fit(data).transform(data)
result.selectExpr("pos.result").show(false)
+--------------------------------------------------+
|result |
+--------------------------------------------------+
|[NNP, NNP, CD, JJ, NNP, NNP, ,, MD, VB, DT, CD, .]|
+--------------------------------------------------+
PromptAssembler
Assembles a sequence of messages into a single string using a template. These strings can then be used as prompts for large language models.
This annotator expects an array of two-tuples as the type of the input column (one array of tuples per row). The first element of the tuples should be the role and the second element is the text of the message. Possible roles are “system”, “user” and “assistant”.
An assistant header can be added to the end of the generated string by using
setAddAssistant(true)
.
At the moment, this annotator uses llama.cpp as a backend to parse and apply the templates. llama.cpp uses basic pattern matching to determine the type of the template, then applies a basic version of the template to the messages. This means that more advanced templates are not supported.
For an extended example see the example notebook.
Input Annotator Types: NONE
Output Annotator Type: DOCUMENT
Python API: PromptAssembler | Scala API: PromptAssembler | Source: PromptAssembler |
Show Example
from sparknlp.base import *
messages = [
[
("system", "You are a helpful assistant."),
("assistant", "Hello there, how can I help you?"),
("user", "I need help with organizing my room."),
]
]
df = spark.createDataFrame([messages]).toDF("messages")
# llama3.1
template = (
"{{- bos_token }} {%- if custom_tools is defined %} {%- set tools = custom_tools %} {%- "
"endif %} {%- if not tools_in_user_message is defined %} {%- set tools_in_user_message = true %} {%- "
'endif %} {%- if not date_string is defined %} {%- set date_string = "26 Jul 2024" %} {%- endif %} '
"{%- if not tools is defined %} {%- set tools = none %} {%- endif %} {#- This block extracts the "
"system message, so we can slot it into the right place. #} {%- if messages[0]['role'] == 'system' %}"
" {%- set system_message = messages[0]['content']|trim %} {%- set messages = messages[1:] %} {%- else"
' %} {%- set system_message = "" %} {%- endif %} {#- System message + builtin tools #} {{- '
'"<|start_header_id|>system<|end_header_id|>\\n\n" }} {%- if builtin_tools is defined or tools is '
'not none %} {{- "Environment: ipython\\n" }} {%- endif %} {%- if builtin_tools is defined %} {{- '
'"Tools: " + builtin_tools | reject(\'equalto\', \'code_interpreter\') | join(", ") + "\\n\n"}} '
'{%- endif %} {{- "Cutting Knowledge Date: December 2023\\n" }} {{- "Today Date: " + date_string '
'+ "\\n\n" }} {%- if tools is not none and not tools_in_user_message %} {{- "You have access to '
'the following functions. To call a function, please respond with JSON for a function call." }} {{- '
'\'Respond in the format {"name": function name, "parameters": dictionary of argument name and its'
' value}.\' }} {{- "Do not use variables.\\n\n" }} {%- for t in tools %} {{- t | tojson(indent=4) '
'}} {{- "\\n\n" }} {%- endfor %} {%- endif %} {{- system_message }} {{- "<|eot_id|>" }} {#- '
"Custom tools are passed in a user message with some extra guidance #} {%- if tools_in_user_message "
"and not tools is none %} {#- Extract the first user message so we can plug it in here #} {%- if "
"messages | length != 0 %} {%- set first_user_message = messages[0]['content']|trim %} {%- set "
'messages = messages[1:] %} {%- else %} {{- raise_exception("Cannot put tools in the first user '
"message when there's no first user message!\") }} {%- endif %} {{- "
"'<|start_header_id|>user<|end_header_id|>\\n\n' -}} {{- \"Given the following functions, please "
'respond with a JSON for a function call " }} {{- "with its proper arguments that best answers the '
'given prompt.\\n\n" }} {{- \'Respond in the format {"name": function name, "parameters": '
'dictionary of argument name and its value}.\' }} {{- "Do not use variables.\\n\n" }} {%- for t in '
'tools %} {{- t | tojson(indent=4) }} {{- "\\n\n" }} {%- endfor %} {{- first_user_message + '
"\"<|eot_id|>\"}} {%- endif %} {%- for message in messages %} {%- if not (message.role == 'ipython' "
"or message.role == 'tool' or 'tool_calls' in message) %} {{- '<|start_header_id|>' + message['role']"
" + '<|end_header_id|>\\n\n'+ message['content'] | trim + '<|eot_id|>' }} {%- elif 'tool_calls' in "
'message %} {%- if not message.tool_calls|length == 1 %} {{- raise_exception("This model only '
'supports single tool-calls at once!") }} {%- endif %} {%- set tool_call = message.tool_calls[0]'
".function %} {%- if builtin_tools is defined and tool_call.name in builtin_tools %} {{- "
"'<|start_header_id|>assistant<|end_header_id|>\\n\n' -}} {{- \"<|python_tag|>\" + tool_call.name + "
'".call(" }} {%- for arg_name, arg_val in tool_call.arguments | items %} {{- arg_name + \'="\' + '
'arg_val + \'"\' }} {%- if not loop.last %} {{- ", " }} {%- endif %} {%- endfor %} {{- ")" }} {%- '
"else %} {{- '<|start_header_id|>assistant<|end_header_id|>\\n\n' -}} {{- '{\"name\": \"' + "
'tool_call.name + \'", \' }} {{- \'"parameters": \' }} {{- tool_call.arguments | tojson }} {{- "}" '
"}} {%- endif %} {%- if builtin_tools is defined %} {#- This means we're in ipython mode #} {{- "
'"<|eom_id|>" }} {%- else %} {{- "<|eot_id|>" }} {%- endif %} {%- elif message.role == "tool" '
'or message.role == "ipython" %} {{- "<|start_header_id|>ipython<|end_header_id|>\\n\n" }} {%- '
"if message.content is mapping or message.content is iterable %} {{- message.content | tojson }} {%- "
'else %} {{- message.content }} {%- endif %} {{- "<|eot_id|>" }} {%- endif %} {%- endfor %} {%- if '
"add_generation_prompt %} {{- '<|start_header_id|>assistant<|end_header_id|>\\n\n' }} {%- endif %} "
)
prompt_assembler = (
PromptAssembler()
.setInputCol("messages")
.setOutputCol("prompt")
.setChatTemplate(template)
)
prompt_assembler.transform(df).select("prompt.result").show(truncate=False)
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[<|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nHello there, how can I help you?<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nI need help with organizing my room.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n]|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
// Batches (whole conversations) of arrays of messages
val data: Seq[Seq[(String, String)]] = Seq(
Seq(
("system", "You are a helpful assistant."),
("assistant", "Hello there, how can I help you?"),
("user", "I need help with organizing my room.")))
val dataDF = data.toDF("messages")
// llama3.1
val template =
"{{- bos_token }} {%- if custom_tools is defined %} {%- set tools = custom_tools %} {%- " +
"endif %} {%- if not tools_in_user_message is defined %} {%- set tools_in_user_message = true %} {%- " +
"endif %} {%- if not date_string is defined %} {%- set date_string = \"26 Jul 2024\" %} {%- endif %} " +
"{%- if not tools is defined %} {%- set tools = none %} {%- endif %} {#- This block extracts the " +
"system message, so we can slot it into the right place. #} {%- if messages[0]['role'] == 'system' %}" +
" {%- set system_message = messages[0]['content']|trim %} {%- set messages = messages[1:] %} {%- else" +
" %} {%- set system_message = \"\" %} {%- endif %} {#- System message + builtin tools #} {{- " +
"\"<|start_header_id|>system<|end_header_id|>\\n\\n\" }} {%- if builtin_tools is defined or tools is " +
"not none %} {{- \"Environment: ipython\\n\" }} {%- endif %} {%- if builtin_tools is defined %} {{- " +
"\"Tools: \" + builtin_tools | reject('equalto', 'code_interpreter') | join(\", \") + \"\\n\\n\"}} " +
"{%- endif %} {{- \"Cutting Knowledge Date: December 2023\\n\" }} {{- \"Today Date: \" + date_string " +
"+ \"\\n\\n\" }} {%- if tools is not none and not tools_in_user_message %} {{- \"You have access to " +
"the following functions. To call a function, please respond with JSON for a function call.\" }} {{- " +
"'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its" +
" value}.' }} {{- \"Do not use variables.\\n\\n\" }} {%- for t in tools %} {{- t | tojson(indent=4) " +
"}} {{- \"\\n\\n\" }} {%- endfor %} {%- endif %} {{- system_message }} {{- \"<|eot_id|>\" }} {#- " +
"Custom tools are passed in a user message with some extra guidance #} {%- if tools_in_user_message " +
"and not tools is none %} {#- Extract the first user message so we can plug it in here #} {%- if " +
"messages | length != 0 %} {%- set first_user_message = messages[0]['content']|trim %} {%- set " +
"messages = messages[1:] %} {%- else %} {{- raise_exception(\"Cannot put tools in the first user " +
"message when there's no first user message!\") }} {%- endif %} {{- " +
"'<|start_header_id|>user<|end_header_id|>\\n\\n' -}} {{- \"Given the following functions, please " +
"respond with a JSON for a function call \" }} {{- \"with its proper arguments that best answers the " +
"given prompt.\\n\\n\" }} {{- 'Respond in the format {\"name\": function name, \"parameters\": " +
"dictionary of argument name and its value}.' }} {{- \"Do not use variables.\\n\\n\" }} {%- for t in " +
"tools %} {{- t | tojson(indent=4) }} {{- \"\\n\\n\" }} {%- endfor %} {{- first_user_message + " +
"\"<|eot_id|>\"}} {%- endif %} {%- for message in messages %} {%- if not (message.role == 'ipython' " +
"or message.role == 'tool' or 'tool_calls' in message) %} {{- '<|start_header_id|>' + message['role']" +
" + '<|end_header_id|>\\n\\n'+ message['content'] | trim + '<|eot_id|>' }} {%- elif 'tool_calls' in " +
"message %} {%- if not message.tool_calls|length == 1 %} {{- raise_exception(\"This model only " +
"supports single tool-calls at once!\") }} {%- endif %} {%- set tool_call = message.tool_calls[0]" +
".function %} {%- if builtin_tools is defined and tool_call.name in builtin_tools %} {{- " +
"'<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}} {{- \"<|python_tag|>\" + tool_call.name + " +
"\".call(\" }} {%- for arg_name, arg_val in tool_call.arguments | items %} {{- arg_name + '=\"' + " +
"arg_val + '\"' }} {%- if not loop.last %} {{- \", \" }} {%- endif %} {%- endfor %} {{- \")\" }} {%- " +
"else %} {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}} {{- '{\"name\": \"' + " +
"tool_call.name + '\", ' }} {{- '\"parameters\": ' }} {{- tool_call.arguments | tojson }} {{- \"}\" " +
"}} {%- endif %} {%- if builtin_tools is defined %} {#- This means we're in ipython mode #} {{- " +
"\"<|eom_id|>\" }} {%- else %} {{- \"<|eot_id|>\" }} {%- endif %} {%- elif message.role == \"tool\" " +
"or message.role == \"ipython\" %} {{- \"<|start_header_id|>ipython<|end_header_id|>\\n\\n\" }} {%- " +
"if message.content is mapping or message.content is iterable %} {{- message.content | tojson }} {%- " +
"else %} {{- message.content }} {%- endif %} {{- \"<|eot_id|>\" }} {%- endif %} {%- endfor %} {%- if " +
"add_generation_prompt %} {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }} {%- endif %} "
val promptAssembler = new PromptAssembler()
.setInputCol("messages")
.setOutputCol("prompt")
.setChatTemplate(template)
promptAssembler.transform(dataDF).select("prompt.result").show(truncate = false)
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[<|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nHello there, how can I help you?<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nI need help with organizing my room.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n]|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
RecursiveTokenizer
Tokenizes raw text recursively based on a handful of definable rules.
Unlike the Tokenizer, the RecursiveTokenizer operates based on these array string parameters only:
prefixes
: Strings that will be split when found at the beginning of token.suffixes
: Strings that will be split when found at the end of token.infixes
: Strings that will be split when found at the middle of token.whitelist
: Whitelist of strings not to split
For extended examples of usage, see the Examples and the TokenizerTestSpec.
Input Annotator Types: DOCUMENT
Output Annotator Type: TOKEN
Python API: RecursiveTokenizer | Scala API: RecursiveTokenizer | Source: RecursiveTokenizer |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = RecursiveTokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
pipeline = Pipeline().setStages([
documentAssembler,
tokenizer
])
data = spark.createDataFrame([["One, after the Other, (and) again. PO, QAM,"]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.select("token.result").show(truncate=False)
+------------------------------------------------------------------+
|result |
+------------------------------------------------------------------+
|[One, ,, after, the, Other, ,, (, and, ), again, ., PO, ,, QAM, ,]|
+------------------------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.RecursiveTokenizer
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new RecursiveTokenizer()
.setInputCols("document")
.setOutputCol("token")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
tokenizer
))
val data = Seq("One, after the Other, (and) again. PO, QAM,").toDF("text")
val result = pipeline.fit(data).transform(data)
result.select("token.result").show(false)
+------------------------------------------------------------------+
|result |
+------------------------------------------------------------------+
|[One, ,, after, the, Other, ,, (, and, ), again, ., PO, ,, QAM, ,]|
+------------------------------------------------------------------+
RegexMatcher
Uses rules to match a set of regular expressions and associate them with a provided identifier.
A rule consists of a regex pattern and an identifier, delimited by a character of choice. An
example could be "\d{4}\/\d\d\/\d\d,date"
which will match strings like "1970/01/01"
to the
identifier "date"
.
Rules must be provided by either setRules
(followed by setDelimiter
) or an external file.
To use an external file, a dictionary of predefined regular expressions must be provided with
setExternalRules
. The dictionary can be set as a delimited text file.
Pretrained pipelines are available for this module, see Pipelines.
For extended examples of usage, see the Examples and the RegexMatcherTestSpec.
Input Annotator Types: DOCUMENT
Output Annotator Type: CHUNK
Python API: RegexMatcher | Scala API: RegexMatcher | Source: RegexMatcher |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
# In this example, the `rules.txt` has the form of
#
# the\s\w+, followed by 'the'
# ceremonies, ceremony
#
# where each regex is separated by the identifier by `","`
documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
sentence = SentenceDetector().setInputCols(["document"]).setOutputCol("sentence")
regexMatcher = RegexMatcher() \
.setExternalRules("src/test/resources/regex-matcher/rules.txt", ",") \
.setInputCols(["sentence"]) \
.setOutputCol("regex") \
.setStrategy("MATCH_ALL")
pipeline = Pipeline().setStages([documentAssembler, sentence, regexMatcher])
data = spark.createDataFrame([[
"My first sentence with the first rule. This is my second sentence with ceremonies rule."
]]).toDF("text")
results = pipeline.fit(data).transform(data)
results.selectExpr("explode(regex) as result").show(truncate=False)
+--------------------------------------------------------------------------------------------+
|result |
+--------------------------------------------------------------------------------------------+
|[chunk, 23, 31, the first, [identifier -> followed by 'the', sentence -> 0, chunk -> 0], []]|
|[chunk, 71, 80, ceremonies, [identifier -> ceremony, sentence -> 1, chunk -> 0], []] |
+--------------------------------------------------------------------------------------------+
// In this example, the `rules.txt` has the form of
//
// the\s\w+, followed by 'the'
// ceremonies, ceremony
//
// where each regex is separated by the identifier by `","`
import ResourceHelper.spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.SentenceDetector
import com.johnsnowlabs.nlp.annotators.RegexMatcher
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document")
val sentence = new SentenceDetector().setInputCols("document").setOutputCol("sentence")
val regexMatcher = new RegexMatcher()
.setExternalRules("src/test/resources/regex-matcher/rules.txt", ",")
.setInputCols(Array("sentence"))
.setOutputCol("regex")
.setStrategy("MATCH_ALL")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, regexMatcher))
val data = Seq(
"My first sentence with the first rule. This is my second sentence with ceremonies rule."
).toDF("text")
val results = pipeline.fit(data).transform(data)
results.selectExpr("explode(regex) as result").show(false)
+--------------------------------------------------------------------------------------------+
|result |
+--------------------------------------------------------------------------------------------+
|[chunk, 23, 31, the first, [identifier -> followed by 'the', sentence -> 0, chunk -> 0], []]|
|[chunk, 71, 80, ceremonies, [identifier -> ceremony, sentence -> 1, chunk -> 0], []] |
+--------------------------------------------------------------------------------------------+
RegexTokenizer
A tokenizer that splits text by a regex pattern.
The pattern needs to be set with setPattern
and this sets the delimiting pattern or how the tokens should be split.
By default this pattern is \s+
which means that tokens should be split by 1 or more whitespace characters.
Input Annotator Types: DOCUMENT
Output Annotator Type: TOKEN
Python API: RegexTokenizer | Scala API: RegexTokenizer | Source: RegexTokenizer |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
regexTokenizer = RegexTokenizer() \
.setInputCols(["document"]) \
.setOutputCol("regexToken") \
.setToLowercase(True) \
.setPattern("\\s+")
pipeline = Pipeline().setStages([
documentAssembler,
regexTokenizer
])
data = spark.createDataFrame([["This is my first sentence.\nThis is my second."]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("regexToken.result").show(truncate=False)
+-------------------------------------------------------+
|result |
+-------------------------------------------------------+
|[this, is, my, first, sentence., this, is, my, second.]|
+-------------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.RegexTokenizer
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val regexTokenizer = new RegexTokenizer()
.setInputCols("document")
.setOutputCol("regexToken")
.setToLowercase(true)
.setPattern("\\s+")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
regexTokenizer
))
val data = Seq("This is my first sentence.\nThis is my second.").toDF("text")
val result = pipeline.fit(data).transform(data)
result.selectExpr("regexToken.result").show(false)
+-------------------------------------------------------+
|result |
+-------------------------------------------------------+
|[this, is, my, first, sentence., this, is, my, second.]|
+-------------------------------------------------------+
SentenceDetector
Annotator that detects sentence boundaries using regular expressions.
The following characters are checked as sentence boundaries:
- Lists (“(i), (ii)”, “(a), (b)”, “1., 2.”)
- Numbers
- Abbreviations
- Punctuations
- Multiple Periods
- Geo-Locations/Coordinates (“N°. 1026.253.553.”)
- Ellipsis (“…”)
- In-between punctuations
- Quotation marks
- Exclamation Points
- Basic Breakers (“.”, “;”)
For the explicit regular expressions used for detection, refer to source of PragmaticContentFormatter.
To add additional custom bounds, the parameter customBounds
can be set with an array:
val sentence = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
.setCustomBounds(Array("\n\n"))
If only the custom bounds should be used, then the parameter useCustomBoundsOnly
should be set to true
.
Each extracted sentence can be returned in an Array or exploded to separate rows,
if explodeSentences
is set to true
.
For extended examples of usage, see the Examples.
Input Annotator Types: DOCUMENT
Output Annotator Type: DOCUMENT
Python API: SentenceDetector | Scala API: SentenceDetector | Source: SentenceDetector |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence") \
.setCustomBounds(["\n\n"])
pipeline = Pipeline().setStages([
documentAssembler,
sentence
])
data = spark.createDataFrame([["This is my first sentence. This my second. How about a third?"]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("explode(sentence) as sentences").show(truncate=False)
+------------------------------------------------------------------+
|sentences |
+------------------------------------------------------------------+
|[document, 0, 25, This is my first sentence., [sentence -> 0], []]|
|[document, 27, 41, This my second., [sentence -> 1], []] |
|[document, 43, 60, How about a third?, [sentence -> 2], []] |
+------------------------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.SentenceDetector
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
.setCustomBounds(Array("\n\n"))
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentence
))
val data = Seq("This is my first sentence. This my second. How about a third?").toDF("text")
val result = pipeline.fit(data).transform(data)
result.selectExpr("explode(sentence) as sentences").show(false)
+------------------------------------------------------------------+
|sentences |
+------------------------------------------------------------------+
|[document, 0, 25, This is my first sentence., [sentence -> 0], []]|
|[document, 27, 41, This my second., [sentence -> 1], []] |
|[document, 43, 60, How about a third?, [sentence -> 2], []] |
+------------------------------------------------------------------+
SentenceDetectorDL
Trains an annotator that detects sentence boundaries using a deep learning approach.
For pretrained models see SentenceDetectorDLModel.
Currently, only the CNN model is supported for training, but in the future the architecture of the model can
be set with setModelArchitecture
.
The default model "cnn"
is based on the paper
Deep-EOS: General-Purpose Neural Networks for Sentence Boundary Detection (2020, Stefan Schweter, Sajawel Ahmed)
using a CNN architecture. We also modified the original implementation a little bit to cover broken sentences and some impossible end of line chars.
Each extracted sentence can be returned in an Array or exploded to separate rows,
if explodeSentences
is set to true
.
For extended examples of usage, see the Examples and the SentenceDetectorDLSpec.
Input Annotator Types: DOCUMENT
Output Annotator Type: DOCUMENT
Python API: SentenceDetectorDLApproach | Scala API: SentenceDetectorDLApproach | Source: SentenceDetectorDLApproach |
Show Example
# The training process needs data, where each data point is a sentence.
# In this example the `train.txt` file has the form of
#
# ...
# Slightly more moderate language would make our present situation – namely the lack of progress – a little easier.
# His political successors now have great responsibilities to history and to the heritage of values bequeathed to them by Nelson Mandela.
# ...
#
# where each line is one sentence.
# Training can then be started like so:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
trainingData = spark.read.text("train.txt").toDF("text")
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLApproach() \
.setInputCols(["document"]) \
.setOutputCol("sentences") \
.setEpochsNumber(100)
pipeline = Pipeline().setStages([documentAssembler, sentenceDetector])
model = pipeline.fit(trainingData)
// The training process needs data, where each data point is a sentence.
// In this example the `train.txt` file has the form of
//
// ...
// Slightly more moderate language would make our present situation – namely the lack of progress – a little easier.
// His political successors now have great responsibilities to history and to the heritage of values bequeathed to them by Nelson Mandela.
// ...
//
// where each line is one sentence.
// Training can then be started like so:
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.sentence_detector_dl.SentenceDetectorDLApproach
import org.apache.spark.ml.Pipeline
val trainingData = spark.read.text("train.txt").toDF("text")
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetectorDLApproach()
.setInputCols(Array("document"))
.setOutputCol("sentences")
.setEpochsNumber(100)
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector))
val model = pipeline.fit(trainingData)
SentenceEmbeddings
Converts the results from WordEmbeddings, BertEmbeddings, or ElmoEmbeddings into sentence or document embeddings by either summing up or averaging all the word embeddings in a sentence or a document (depending on the inputCols).
This can be configured with setPoolingStrategy
, which either be "AVERAGE"
or "SUM"
.
For more extended examples see the Examples. and the SentenceEmbeddingsTestSpec.
TIP: Here is how you can explode and convert these embeddings into Vectors
or what’s known as Feature
column so it can be used in Spark ML regression or clustering functions:
from org.apache.spark.ml.linal import Vector, Vectors
from pyspark.sql.functions import udf
# Let's create a UDF to take array of embeddings and output Vectors
@udf(Vector)
def convertToVectorUDF(matrix):
return Vectors.dense(matrix.toArray.map(_.toDouble))
# Now let's explode the sentence_embeddings column and have a new feature column for Spark ML
pipelineDF.select(explode("sentence_embeddings.embeddings").as("sentence_embedding"))
.withColumn("features", convertToVectorUDF("sentence_embedding"))
import org.apache.spark.ml.linalg.{Vector, Vectors}
// Let's create a UDF to take array of embeddings and output Vectors
val convertToVectorUDF = udf((matrix : Seq[Float]) => {
Vectors.dense(matrix.toArray.map(_.toDouble))
})
// Now let's explode the sentence_embeddings column and have a new feature column for Spark ML
pipelineDF.select(explode($"sentence_embeddings.embeddings").as("sentence_embedding"))
.withColumn("features", convertToVectorUDF($"sentence_embedding"))
Input Annotator Types: DOCUMENT, WORD_EMBEDDINGS
Output Annotator Type: SENTENCE_EMBEDDINGS
Note: If you choose
document
as your input forTokenizer
,WordEmbeddings
/BertEmbeddings
, andSentenceEmbeddings
then it averages/sums all the embeddings into one array of embeddings. However, if you choosesentence
asinputCols
then for each sentenceSentenceEmbeddings
generates one array of embeddings.
Python API: SentenceEmbeddings | Scala API: SentenceEmbeddings | Source: SentenceEmbeddings |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained() \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
embeddingsSentence = SentenceEmbeddings() \
.setInputCols(["document", "embeddings"]) \
.setOutputCol("sentence_embeddings") \
.setPoolingStrategy("AVERAGE")
embeddingsFinisher = EmbeddingsFinisher() \
.setInputCols(["sentence_embeddings"]) \
.setOutputCols("finished_embeddings") \
.setOutputAsVector(True) \
.setCleanAnnotations(False)
pipeline = Pipeline() \
.setStages([
documentAssembler,
tokenizer,
embeddings,
embeddingsSentence,
embeddingsFinisher
])
data = spark.createDataFrame([["This is a sentence."]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
| result|
+--------------------------------------------------------------------------------+
|[-0.22093398869037628,0.25130119919776917,0.41810303926467896,-0.380883991718...|
+--------------------------------------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel
import com.johnsnowlabs.nlp.embeddings.SentenceEmbeddings
import com.johnsnowlabs.nlp.EmbeddingsFinisher
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained()
.setInputCols("document", "token")
.setOutputCol("embeddings")
val embeddingsSentence = new SentenceEmbeddings()
.setInputCols(Array("document", "embeddings"))
.setOutputCol("sentence_embeddings")
.setPoolingStrategy("AVERAGE")
val embeddingsFinisher = new EmbeddingsFinisher()
.setInputCols("sentence_embeddings")
.setOutputCols("finished_embeddings")
.setOutputAsVector(true)
.setCleanAnnotations(false)
val pipeline = new Pipeline()
.setStages(Array(
documentAssembler,
tokenizer,
embeddings,
embeddingsSentence,
embeddingsFinisher
))
val data = Seq("This is a sentence.").toDF("text")
val result = pipeline.fit(data).transform(data)
result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
| result|
+--------------------------------------------------------------------------------+
|[-0.22093398869037628,0.25130119919776917,0.41810303926467896,-0.380883991718...|
+--------------------------------------------------------------------------------+
SentimentDL
Trains a SentimentDL, an annotator for multi-class sentiment analysis.
In natural language processing, sentiment analysis is the task of classifying the affective state or subjective view of a text. A common example is if either a product review or tweet can be interpreted positively or negatively.
For the instantiated/pretrained models, see SentimentDLModel.
Notes:
- This annotator accepts a label column of a single item in either type of
String, Int, Float, or Double. So positive sentiment can be expressed as
either
"positive"
or0
, negative sentiment as"negative"
or1
. - UniversalSentenceEncoder, BertSentenceEmbeddings, SentenceEmbeddings or other sentence based embeddings can be used
Setting a test dataset to monitor model metrics can be done with .setTestDataset
. The method
expects a path to a parquet file containing a dataframe that has the same required columns as
the training dataframe. The pre-processing steps for the training dataframe should also be
applied to the test dataframe. The following example will show how to create the test dataset:
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val embeddings = UniversalSentenceEncoder.pretrained()
.setInputCols("document")
.setOutputCol("sentence_embeddings")
val preProcessingPipeline = new Pipeline().setStages(Array(documentAssembler, embeddings))
val Array(train, test) = data.randomSplit(Array(0.8, 0.2))
preProcessingPipeline
.fit(test)
.transform(test)
.write
.mode("overwrite")
.parquet("test_data")
val classifier = new SentimentDLApproach()
.setInputCols("sentence_embeddings")
.setOutputCol("sentiment")
.setLabelColumn("label")
.setTestDataset("test_data")
For extended examples of usage, see the Examples and the SentimentDLTestSpec.
Input Annotator Types: SENTENCE_EMBEDDINGS
Output Annotator Type: CATEGORY
Python API: SentimentDLApproach | Scala API: SentimentDLApproach | Source: SentimentDLApproach |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
# In this example, `sentiment.csv` is in the form
#
# text,label
# This movie is the best movie I have watched ever! In my opinion this movie can win an award.,0
# This was a terrible movie! The acting was bad really bad!,1
#
# The model can then be trained with
smallCorpus = spark.read.option("header", "True").csv("src/test/resources/classifier/sentiment.csv")
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
useEmbeddings = UniversalSentenceEncoder.pretrained() \
.setInputCols(["document"]) \
.setOutputCol("sentence_embeddings")
docClassifier = SentimentDLApproach() \
.setInputCols(["sentence_embeddings"]) \
.setOutputCol("sentiment") \
.setLabelColumn("label") \
.setBatchSize(32) \
.setMaxEpochs(1) \
.setLr(5e-3) \
.setDropout(0.5)
pipeline = Pipeline() \
.setStages(
[
documentAssembler,
useEmbeddings,
docClassifier
]
)
pipelineModel = pipeline.fit(smallCorpus)
// In this example, `sentiment.csv` is in the form
//
// text,label
// This movie is the best movie I have watched ever! In my opinion this movie can win an award.,0
// This was a terrible movie! The acting was bad really bad!,1
//
// The model can then be trained with
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.UniversalSentenceEncoder
import com.johnsnowlabs.nlp.annotators.classifier.dl.{SentimentDLApproach, SentimentDLModel}
import org.apache.spark.ml.Pipeline
val smallCorpus = spark.read.option("header", "true").csv("src/test/resources/classifier/sentiment.csv")
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val useEmbeddings = UniversalSentenceEncoder.pretrained()
.setInputCols("document")
.setOutputCol("sentence_embeddings")
val docClassifier = new SentimentDLApproach()
.setInputCols("sentence_embeddings")
.setOutputCol("sentiment")
.setLabelColumn("label")
.setBatchSize(32)
.setMaxEpochs(1)
.setLr(5e-3f)
.setDropout(0.5f)
val pipeline = new Pipeline()
.setStages(
Array(
documentAssembler,
useEmbeddings,
docClassifier
)
)
val pipelineModel = pipeline.fit(smallCorpus)
SentimentDetector
Trains a rule based sentiment detector, which calculates a score based on predefined keywords.
A dictionary of predefined sentiment keywords must be provided with setDictionary
, where each line is a word
delimited to its class (either positive
or negative
).
The dictionary can be set as a delimited text file.
By default, the sentiment score will be assigned labels "positive"
if the score is >= 0
, else "negative"
.
To retrieve the raw sentiment scores, enableScore
needs to be set to true
.
For extended examples of usage, see the Examples and the SentimentTestSpec.
Input Annotator Types: TOKEN, DOCUMENT
Output Annotator Type: SENTIMENT
Python API: SentimentDetector | Scala API: SentimentDetector | Source: SentimentDetector |
Show Example
# In this example, the dictionary `default-sentiment-dict.txt` has the form of
#
# ...
# cool,positive
# superb,positive
# bad,negative
# uninspired,negative
# ...
#
# where each sentiment keyword is delimited by `","`.
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
lemmatizer = Lemmatizer() \
.setInputCols(["token"]) \
.setOutputCol("lemma") \
.setDictionary("lemmas_small.txt", "->", "\t")
sentimentDetector = SentimentDetector() \
.setInputCols(["lemma", "document"]) \
.setOutputCol("sentimentScore") \
.setDictionary("default-sentiment-dict.txt", ",", ReadAs.TEXT)
pipeline = Pipeline().setStages([
documentAssembler,
tokenizer,
lemmatizer,
sentimentDetector,
])
data = spark.createDataFrame([
["The staff of the restaurant is nice"],
["I recommend others to avoid because it is too expensive"]
]).toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("sentimentScore.result").show(truncate=False)
+----------+ # +------+ for enableScore set to True
|result | # |result|
+----------+ # +------+
|[positive]| # |[1.0] |
|[negative]| # |[-2.0]|
+----------+ # +------+
// In this example, the dictionary `default-sentiment-dict.txt` has the form of
//
// ...
// cool,positive
// superb,positive
// bad,negative
// uninspired,negative
// ...
//
// where each sentiment keyword is delimited by `","`.
import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.Tokenizer
import com.johnsnowlabs.nlp.annotators.Lemmatizer
import com.johnsnowlabs.nlp.annotators.sda.pragmatic.SentimentDetector
import com.johnsnowlabs.nlp.util.io.ReadAs
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val lemmatizer = new Lemmatizer()
.setInputCols("token")
.setOutputCol("lemma")
.setDictionary("src/test/resources/lemma-corpus-small/lemmas_small.txt", "->", "\t")
val sentimentDetector = new SentimentDetector()
.setInputCols("lemma", "document")
.setOutputCol("sentimentScore")
.setDictionary("src/test/resources/sentiment-corpus/default-sentiment-dict.txt", ",", ReadAs.TEXT)
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
tokenizer,
lemmatizer,
sentimentDetector,
))
val data = Seq(
"The staff of the restaurant is nice",
"I recommend others to avoid because it is too expensive"
).toDF("text")
val result = pipeline.fit(data).transform(data)
result.selectExpr("sentimentScore.result").show(false)
+----------+ // +------+ for enableScore set to true
|result | // |result|
+----------+ // +------+
|[positive]| // |[1.0] |
|[negative]| // |[-2.0]|
+----------+ // +------+
Stemmer
Returns hard-stems out of words with the objective of retrieving the meaningful part of the word. For extended examples of usage, see the Examples.
Input Annotator Types: TOKEN
Output Annotator Type: TOKEN
Python API: Stemmer | Scala API: Stemmer | Source: Stemmer |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
stemmer = Stemmer() \
.setInputCols(["token"]) \
.setOutputCol("stem")
pipeline = Pipeline().setStages([
documentAssembler,
tokenizer,
stemmer
])
data = spark.createDataFrame([["Peter Pipers employees are picking pecks of pickled peppers."]]) \
.toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("stem.result").show(truncate = False)
+-------------------------------------------------------------+
|result |
+-------------------------------------------------------------+
|[peter, piper, employe, ar, pick, peck, of, pickl, pepper, .]|
+-------------------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.{Stemmer, Tokenizer}
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val stemmer = new Stemmer()
.setInputCols("token")
.setOutputCol("stem")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
tokenizer,
stemmer
))
val data = Seq("Peter Pipers employees are picking pecks of pickled peppers.")
.toDF("text")
val result = pipeline.fit(data).transform(data)
result.selectExpr("stem.result").show(truncate = false)
+-------------------------------------------------------------+
|result |
+-------------------------------------------------------------+
|[peter, piper, employe, ar, pick, peck, of, pickl, pepper, .]|
+-------------------------------------------------------------+
StopWordsCleaner
This annotator takes a sequence of strings (e.g. the output of a Tokenizer, Normalizer, Lemmatizer, and Stemmer) and drops all the stop words from the input sequences.
By default, it uses stop words from MLlibs
StopWordsRemover.
Stop words can also be defined by explicitly setting them with setStopWords(value: Array[String])
or loaded from
pretrained models using pretrained
of its companion object.
val stopWords = StopWordsCleaner.pretrained()
.setInputCols("token")
.setOutputCol("cleanTokens")
.setCaseSensitive(false)
// will load the default pretrained model `"stopwords_en"`.
For available pretrained models please see the Models Hub.
For extended examples of usage, see the Examples and StopWordsCleanerTestSpec.
NOTE: If you need to
setStopWords
from a text file, you can first read and convert it into an array of string as follows.
# your stop words text file, each line is one stop word
stopwords = sc.textFile("/tmp/stopwords/english.txt").collect()
# simply use it in StopWordsCleaner
stopWordsCleaner = StopWordsCleaner()\
.setInputCols("token")\
.setOutputCol("cleanTokens")\
.setStopWords(stopwords)\
.setCaseSensitive(False)
# or you can use pretrained models for StopWordsCleaner
stopWordsCleaner = StopWordsCleaner.pretrained()
.setInputCols("token")\
.setOutputCol("cleanTokens")\
.setCaseSensitive(False)
// your stop words text file, each line is one stop word
val stopwords = sc.textFile("/tmp/stopwords/english.txt").collect()
// simply use it in StopWordsCleaner
val stopWordsCleaner = new StopWordsCleaner()
.setInputCols("token")
.setOutputCol("cleanTokens")
.setStopWords(stopwords)
.setCaseSensitive(false)
// or you can use pretrained models for StopWordsCleaner
val stopWordsCleaner = StopWordsCleaner.pretrained()
.setInputCols("token")
.setOutputCol("cleanTokens")
.setCaseSensitive(false)
Input Annotator Types: TOKEN
Output Annotator Type: TOKEN
Python API: StopWordsCleaner | Scala API: StopWordsCleaner | Source: StopWordsCleaner |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
stopWords = StopWordsCleaner() \
.setInputCols(["token"]) \
.setOutputCol("cleanTokens") \
.setCaseSensitive(False)
pipeline = Pipeline().setStages([
documentAssembler,
sentenceDetector,
tokenizer,
stopWords
])
data = spark.createDataFrame([
["This is my first sentence. This is my second."],
["This is my third sentence. This is my forth."]
]).toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("cleanTokens.result").show(truncate=False)
+-------------------------------+
|result |
+-------------------------------+
|[first, sentence, ., second, .]|
|[third, sentence, ., forth, .] |
+-------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.Tokenizer
import com.johnsnowlabs.nlp.annotator.SentenceDetector
import com.johnsnowlabs.nlp.annotators.StopWordsCleaner
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val stopWords = new StopWordsCleaner()
.setInputCols("token")
.setOutputCol("cleanTokens")
.setCaseSensitive(false)
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
stopWords
))
val data = Seq(
"This is my first sentence. This is my second.",
"This is my third sentence. This is my forth."
).toDF("text")
val result = pipeline.fit(data).transform(data)
result.selectExpr("cleanTokens.result").show(false)
+-------------------------------+
|result |
+-------------------------------+
|[first, sentence, ., second, .]|
|[third, sentence, ., forth, .] |
+-------------------------------+
SymmetricDelete Spellchecker
Trains a Symmetric Delete spelling correction algorithm. Retrieves tokens and utilizes distance metrics to compute possible derived words.
Inspired by SymSpell.
For instantiated/pretrained models, see SymmetricDeleteModel.
See SymmetricDeleteModelTestSpec for further reference.
Input Annotator Types: TOKEN
Output Annotator Type: TOKEN
Python API: SymmetricDeleteApproach | Scala API: SymmetricDeleteApproach | Source: SymmetricDeleteApproach |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
# In this example, the dictionary `"words.txt"` has the form of
#
# ...
# gummy
# gummic
# gummier
# gummiest
# gummiferous
# ...
#
# This dictionary is then set to be the basis of the spell checker.
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
spellChecker = SymmetricDeleteApproach() \
.setInputCols(["token"]) \
.setOutputCol("spell") \
.setDictionary("src/test/resources/spell/words.txt")
pipeline = Pipeline().setStages([
documentAssembler,
tokenizer,
spellChecker
])
pipelineModel = pipeline.fit(trainingData)
// In this example, the dictionary `"words.txt"` has the form of
//
// ...
// gummy
// gummic
// gummier
// gummiest
// gummiferous
// ...
//
// This dictionary is then set to be the basis of the spell checker.
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.spell.symmetric.SymmetricDeleteApproach
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val spellChecker = new SymmetricDeleteApproach()
.setInputCols("token")
.setOutputCol("spell")
.setDictionary("src/test/resources/spell/words.txt")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
tokenizer,
spellChecker
))
val pipelineModel = pipeline.fit(trainingData)
TextMatcher
Annotator to match exact phrases (by token) provided in a file against a Document.
A text file of predefined phrases must be provided with setEntities
.
For extended examples of usage, see the Examples and the TextMatcherTestSpec.
Input Annotator Types: DOCUMENT, TOKEN
Output Annotator Type: CHUNK
Python API: TextMatcher | Scala API: TextMatcher | Source: TextMatcher |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
# In this example, the entities file is of the form
#
# ...
# dolore magna aliqua
# lorem ipsum dolor. sit
# laborum
# ...
#
# where each line represents an entity phrase to be extracted.
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
data = spark.createDataFrame([["Hello dolore magna aliqua. Lorem ipsum dolor. sit in laborum"]]).toDF("text")
entityExtractor = TextMatcher() \
.setInputCols(["document", "token"]) \
.setEntities("src/test/resources/entity-extractor/test-phrases.txt", ReadAs.TEXT) \
.setOutputCol("entity") \
.setCaseSensitive(False)
pipeline = Pipeline().setStages([documentAssembler, tokenizer, entityExtractor])
results = pipeline.fit(data).transform(data)
results.selectExpr("explode(entity) as result").show(truncate=False)
+------------------------------------------------------------------------------------------+
|result |
+------------------------------------------------------------------------------------------+
|[chunk, 6, 24, dolore magna aliqua, [entity -> entity, sentence -> 0, chunk -> 0], []] |
|[chunk, 27, 48, Lorem ipsum dolor. sit, [entity -> entity, sentence -> 0, chunk -> 1], []]|
|[chunk, 53, 59, laborum, [entity -> entity, sentence -> 0, chunk -> 2], []] |
+------------------------------------------------------------------------------------------+
// In this example, the entities file is of the form
//
// ...
// dolore magna aliqua
// lorem ipsum dolor. sit
// laborum
// ...
//
// where each line represents an entity phrase to be extracted.
import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.Tokenizer
import com.johnsnowlabs.nlp.annotator.TextMatcher
import com.johnsnowlabs.nlp.util.io.ReadAs
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val data = Seq("Hello dolore magna aliqua. Lorem ipsum dolor. sit in laborum").toDF("text")
val entityExtractor = new TextMatcher()
.setInputCols("document", "token")
.setEntities("src/test/resources/entity-extractor/test-phrases.txt", ReadAs.TEXT)
.setOutputCol("entity")
.setCaseSensitive(false)
.setTokenizer(tokenizer.fit(data))
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, entityExtractor))
val results = pipeline.fit(data).transform(data)
results.selectExpr("explode(entity) as result").show(false)
+------------------------------------------------------------------------------------------+
|result |
+------------------------------------------------------------------------------------------+
|[chunk, 6, 24, dolore magna aliqua, [entity -> entity, sentence -> 0, chunk -> 0], []] |
|[chunk, 27, 48, Lorem ipsum dolor. sit, [entity -> entity, sentence -> 0, chunk -> 1], []]|
|[chunk, 53, 59, laborum, [entity -> entity, sentence -> 0, chunk -> 2], []] |
+------------------------------------------------------------------------------------------+
Token2Chunk
Converts TOKEN
type Annotations to CHUNK
type.
This can be useful if a entities have been already extracted as TOKEN
and following annotators require CHUNK
types.
Input Annotator Types: TOKEN
Output Annotator Type: CHUNK
Python API: Token2Chunk | Scala API: Token2Chunk | Source: Token2Chunk |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
token2chunk = Token2Chunk() \
.setInputCols(["token"]) \
.setOutputCol("chunk")
pipeline = Pipeline().setStages([
documentAssembler,
tokenizer,
token2chunk
])
data = spark.createDataFrame([["One Two Three Four"]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("explode(chunk) as result").show(truncate=False)
+------------------------------------------+
|result |
+------------------------------------------+
|[chunk, 0, 2, One, [sentence -> 0], []] |
|[chunk, 4, 6, Two, [sentence -> 0], []] |
|[chunk, 8, 12, Three, [sentence -> 0], []]|
|[chunk, 14, 17, Four, [sentence -> 0], []]|
+------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.{Token2Chunk, Tokenizer}
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val token2chunk = new Token2Chunk()
.setInputCols("token")
.setOutputCol("chunk")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
tokenizer,
token2chunk
))
val data = Seq("One Two Three Four").toDF("text")
val result = pipeline.fit(data).transform(data)
result.selectExpr("explode(chunk) as result").show(false)
+------------------------------------------+
|result |
+------------------------------------------+
|[chunk, 0, 2, One, [sentence -> 0], []] |
|[chunk, 4, 6, Two, [sentence -> 0], []] |
|[chunk, 8, 12, Three, [sentence -> 0], []]|
|[chunk, 14, 17, Four, [sentence -> 0], []]|
+------------------------------------------+
TokenAssembler
This transformer reconstructs a DOCUMENT
type annotation from tokens, usually after these have been normalized,
lemmatized, normalized, spell checked, etc, in order to use this document annotation in further annotators.
Requires DOCUMENT
and TOKEN
type annotations as input.
For more extended examples on document pre-processing see the Examples.
Input Annotator Types: DOCUMENT, TOKEN
Output Annotator Type: DOCUMENT
Python API: TokenAssembler | Scala API: TokenAssembler | Source: TokenAssembler |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
# First, the text is tokenized and cleaned
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentences")
tokenizer = Tokenizer() \
.setInputCols(["sentences"]) \
.setOutputCol("token")
normalizer = Normalizer() \
.setInputCols(["token"]) \
.setOutputCol("normalized") \
.setLowercase(False)
stopwordsCleaner = StopWordsCleaner() \
.setInputCols(["normalized"]) \
.setOutputCol("cleanTokens") \
.setCaseSensitive(False)
# Then the TokenAssembler turns the cleaned tokens into a `DOCUMENT` type structure.
tokenAssembler = TokenAssembler() \
.setInputCols(["sentences", "cleanTokens"]) \
.setOutputCol("cleanText")
data = spark.createDataFrame([["Spark NLP is an open-source text processing library for advanced natural language processing."]]) \
.toDF("text")
pipeline = Pipeline().setStages([
documentAssembler,
sentenceDetector,
tokenizer,
normalizer,
stopwordsCleaner,
tokenAssembler
]).fit(data)
result = pipeline.transform(data)
result.select("cleanText").show(truncate=False)
+---------------------------------------------------------------------------------------------------------------------------+
|cleanText |
+---------------------------------------------------------------------------------------------------------------------------+
|0, 80, Spark NLP opensource text processing library advanced natural language processing, [sentence -> 0], []|
+---------------------------------------------------------------------------------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.SentenceDetector
import com.johnsnowlabs.nlp.annotator.Tokenizer
import com.johnsnowlabs.nlp.annotator.{Normalizer, StopWordsCleaner}
import com.johnsnowlabs.nlp.TokenAssembler
import org.apache.spark.ml.Pipeline
// First, the text is tokenized and cleaned
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentences")
val tokenizer = new Tokenizer()
.setInputCols("sentences")
.setOutputCol("token")
val normalizer = new Normalizer()
.setInputCols("token")
.setOutputCol("normalized")
.setLowercase(false)
val stopwordsCleaner = new StopWordsCleaner()
.setInputCols("normalized")
.setOutputCol("cleanTokens")
.setCaseSensitive(false)
// Then the TokenAssembler turns the cleaned tokens into a `DOCUMENT` type structure.
val tokenAssembler = new TokenAssembler()
.setInputCols("sentences", "cleanTokens")
.setOutputCol("cleanText")
val data = Seq("Spark NLP is an open-source text processing library for advanced natural language processing.")
.toDF("text")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
normalizer,
stopwordsCleaner,
tokenAssembler
)).fit(data)
val result = pipeline.transform(data)
result.select("cleanText").show(false)
+---------------------------------------------------------------------------------------------------------------------------+
|cleanText |
+---------------------------------------------------------------------------------------------------------------------------+
|[[document, 0, 80, Spark NLP opensource text processing library advanced natural language processing, [sentence -> 0], []]]|
+---------------------------------------------------------------------------------------------------------------------------+
Tokenizer
Tokenizes raw text in document type columns into TokenizedSentence .
This class represents a non fitted tokenizer. Fitting it will cause the internal RuleFactory to construct the rules for tokenizing from the input configuration.
Identifies tokens with tokenization open standards. A few rules will help customizing it if defaults do not fit user needs.
For extended examples of usage see the Examples and Tokenizer test class
Input Annotator Types: DOCUMENT
Output Annotator Type: TOKEN
Note: All these APIs receive regular expressions so please make sure that you escape special characters according to Java conventions.
Python API: Tokenizer | Scala API: Tokenizer | Source: Tokenizer |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
data = spark.createDataFrame([["I'd like to say we didn't expect that. Jane's boyfriend."]]).toDF("text")
documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
tokenizer = Tokenizer().setInputCols(["document"]).setOutputCol("token").fit(data)
pipeline = Pipeline().setStages([documentAssembler, tokenizer]).fit(data)
result = pipeline.transform(data)
result.selectExpr("token.result").show(truncate=False)
+-----------------------------------------------------------------------+
|output |
+-----------------------------------------------------------------------+
|[I'd, like, to, say, we, didn't, expect, that, ., Jane's, boyfriend, .]|
+-----------------------------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import org.apache.spark.ml.Pipeline
val data = Seq("I'd like to say we didn't expect that. Jane's boyfriend.").toDF("text")
val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document")
val tokenizer = new Tokenizer().setInputCols("document").setOutputCol("token").fit(data)
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer)).fit(data)
val result = pipeline.transform(data)
result.selectExpr("token.result").show(false)
+-----------------------------------------------------------------------+
|output |
+-----------------------------------------------------------------------+
|[I'd, like, to, say, we, didn't, expect, that, ., Jane's, boyfriend, .]|
+-----------------------------------------------------------------------+
TypedDependencyParser
Labeled parser that finds a grammatical relation between two words in a sentence. Its input is either a CoNLL2009 or ConllU dataset.
For instantiated/pretrained models, see TypedDependencyParserModel.
Dependency parsers provide information about word relationship. For example, dependency parsing can tell you what the subjects and objects of a verb are, as well as which words are modifying (describing) the subject. This can help you find precise answers to specific questions.
The parser requires the dependant tokens beforehand with e.g. DependencyParser. The required training data can be set in two different ways (only one can be chosen for a particular model):
- Dataset in the CoNLL 2009 format set with
setConll2009
- Dataset in the CoNLL-U format set with
setConllU
Apart from that, no additional training data is needed.
See TypedDependencyParserApproachTestSpec for further reference on this API.
Input Annotator Types: TOKEN, POS, DEPENDENCY
Output Annotator Type: LABELED_DEPENDENCY
Python API: TypedDependencyParserApproach | Scala API: TypedDependencyParserApproach | Source: TypedDependencyParserApproach |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
posTagger = PerceptronModel.pretrained() \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos")
dependencyParser = DependencyParserModel.pretrained() \
.setInputCols(["sentence", "pos", "token"]) \
.setOutputCol("dependency")
typedDependencyParser = TypedDependencyParserApproach() \
.setInputCols(["dependency", "pos", "token"]) \
.setOutputCol("dependency_type") \
.setConllU("src/test/resources/parser/labeled/train_small.conllu.txt") \
.setNumberOfIterations(1)
pipeline = Pipeline().setStages([
documentAssembler,
sentence,
tokenizer,
posTagger,
dependencyParser,
typedDependencyParser
])
# Additional training data is not needed, the dependency parser relies on CoNLL-U only.
emptyDataSet = spark.createDataFrame([[""]]).toDF("text")
pipelineModel = pipeline.fit(emptyDataSet)
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel
import com.johnsnowlabs.nlp.annotators.parser.dep.DependencyParserModel
import com.johnsnowlabs.nlp.annotators.parser.typdep.TypedDependencyParserApproach
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val posTagger = PerceptronModel.pretrained()
.setInputCols("sentence", "token")
.setOutputCol("pos")
val dependencyParser = DependencyParserModel.pretrained()
.setInputCols("sentence", "pos", "token")
.setOutputCol("dependency")
val typedDependencyParser = new TypedDependencyParserApproach()
.setInputCols("dependency", "pos", "token")
.setOutputCol("dependency_type")
.setConllU("src/test/resources/parser/labeled/train_small.conllu.txt")
.setNumberOfIterations(1)
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentence,
tokenizer,
posTagger,
dependencyParser,
typedDependencyParser
))
// Additional training data is not needed, the dependency parser relies on CoNLL-U only.
val emptyDataSet = Seq.empty[String].toDF("text")
val pipelineModel = pipeline.fit(emptyDataSet)
ViveknSentiment
Trains a sentiment analyser inspired by the algorithm by Vivek Narayanan https://github.com/vivekn/sentiment/.
The algorithm is based on the paper “Fast and accurate sentiment classification using an enhanced Naive Bayes model”.
The analyzer requires sentence boundaries to give a score in context. Tokenization is needed to make sure tokens are within bounds. Transitivity requirements are also required.
The training data needs to consist of a column for normalized text and a label column (either "positive"
or "negative"
).
For extended examples of usage, see the Examples and the ViveknSentimentTestSpec.
Input Annotator Types: TOKEN, DOCUMENT
Output Annotator Type: SENTIMENT
Python API: ViveknSentimentApproach | Scala API: ViveknSentimentApproach | Source: ViveknSentimentApproach |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
document = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
token = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
normalizer = Normalizer() \
.setInputCols(["token"]) \
.setOutputCol("normal")
vivekn = ViveknSentimentApproach() \
.setInputCols(["document", "normal"]) \
.setSentimentCol("train_sentiment") \
.setOutputCol("result_sentiment")
finisher = Finisher() \
.setInputCols(["result_sentiment"]) \
.setOutputCols("final_sentiment")
pipeline = Pipeline().setStages([document, token, normalizer, vivekn, finisher])
training = spark.createDataFrame([
("I really liked this movie!", "positive"),
("The cast was horrible", "negative"),
("Never going to watch this again or recommend it to anyone", "negative"),
("It's a waste of time", "negative"),
("I loved the protagonist", "positive"),
("The music was really really good", "positive")
]).toDF("text", "train_sentiment")
pipelineModel = pipeline.fit(training)
data = spark.createDataFrame([
["I recommend this movie"],
["Dont waste your time!!!"]
]).toDF("text")
result = pipelineModel.transform(data)
result.select("final_sentiment").show(truncate=False)
+---------------+
|final_sentiment|
+---------------+
|[positive] |
|[negative] |
+---------------+
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.Normalizer
import com.johnsnowlabs.nlp.annotators.sda.vivekn.ViveknSentimentApproach
import com.johnsnowlabs.nlp.Finisher
import org.apache.spark.ml.Pipeline
val document = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val token = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val normalizer = new Normalizer()
.setInputCols("token")
.setOutputCol("normal")
val vivekn = new ViveknSentimentApproach()
.setInputCols("document", "normal")
.setSentimentCol("train_sentiment")
.setOutputCol("result_sentiment")
val finisher = new Finisher()
.setInputCols("result_sentiment")
.setOutputCols("final_sentiment")
val pipeline = new Pipeline().setStages(Array(document, token, normalizer, vivekn, finisher))
val training = Seq(
("I really liked this movie!", "positive"),
("The cast was horrible", "negative"),
("Never going to watch this again or recommend it to anyone", "negative"),
("It's a waste of time", "negative"),
("I loved the protagonist", "positive"),
("The music was really really good", "positive")
).toDF("text", "train_sentiment")
val pipelineModel = pipeline.fit(training)
val data = Seq(
"I recommend this movie",
"Dont waste your time!!!"
).toDF("text")
val result = pipelineModel.transform(data)
result.select("final_sentiment").show(false)
+---------------+
|final_sentiment|
+---------------+
|[positive] |
|[negative] |
+---------------+
Word2Vec
Trains a Word2Vec model that creates vector representations of words in a text corpus.
The algorithm first constructs a vocabulary from the corpus and then learns vector representation of words in the vocabulary. The vector representation can be used as features in natural language processing and machine learning algorithms.
We use Word2Vec implemented in Spark ML. It uses skip-gram model in our implementation and a hierarchical softmax method to train the model. The variable names in the implementation match the original C implementation.
For instantiated/pretrained models, see Word2VecModel.
Sources :
For the original C implementation, see https://code.google.com/p/word2vec/
For the research paper, see Efficient Estimation of Word Representations in Vector Space and Distributed Representations of Words and Phrases and their Compositionality.
Input Annotator Types: TOKEN
Output Annotator Type: WORD_EMBEDDINGS
Python API: Word2VecApproach | Scala API: Word2VecApproach | Source: Word2VecApproach |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
embeddings = Word2VecApproach() \
.setInputCols(["token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline() \
.setStages([
documentAssembler,
tokenizer,
embeddings
])
path = "sherlockholmes.txt"
dataset = spark.read.text(path).toDF("text")
pipelineModel = pipeline.fit(dataset)
import spark.implicits._
import com.johnsnowlabs.nlp.annotator.{Tokenizer, Word2VecApproach}
import com.johnsnowlabs.nlp.base.DocumentAssembler
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = new Word2VecApproach()
.setInputCols("token")
.setOutputCol("embeddings")
val pipeline = new Pipeline()
.setStages(Array(
documentAssembler,
tokenizer,
embeddings
))
val path = "src/test/resources/spell/sherlockholmes.txt"
val dataset = spark.sparkContext.textFile(path)
.toDF("text")
val pipelineModel = pipeline.fit(dataset)
WordEmbeddings
Word Embeddings lookup annotator that maps tokens to vectors.
For instantiated/pretrained models, see WordEmbeddingsModel.
A custom token lookup dictionary for embeddings can be set with setStoragePath
.
Each line of the provided file needs to have a token, followed by their vector representation, delimited by a spaces.
...
are 0.39658191506190343 0.630968081620067 0.5393722253731201 0.8428180123359783
were 0.7535235923631415 0.9699218875629833 0.10397182122983872 0.11833962569383116
stress 0.0492683418305907 0.9415954572751959 0.47624463167525755 0.16790967216778263
induced 0.1535748762292387 0.33498936903209897 0.9235178224122094 0.1158772920395934
...
If a token is not found in the dictionary, then the result will be a zero vector of the same dimension.
Statistics about the rate of converted tokens, can be retrieved with[WordEmbeddingsModel.withCoverageColumn
and WordEmbeddingsModel.overallCoverage
.
For extended examples of usage, see the Examples and the WordEmbeddingsTestSpec.
Input Annotator Types: DOCUMENT, TOKEN
Output Annotator Type: WORD_EMBEDDINGS
Python API: WordEmbeddings | Scala API: WordEmbeddings | Source: WordEmbeddings |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
# In this example, the file `random_embeddings_dim4.txt` has the form of the content above.
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
embeddings = WordEmbeddings() \
.setStoragePath("src/test/resources/random_embeddings_dim4.txt", ReadAs.TEXT) \
.setStorageRef("glove_4d") \
.setDimension(4) \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
embeddingsFinisher = EmbeddingsFinisher() \
.setInputCols(["embeddings"]) \
.setOutputCols("finished_embeddings") \
.setOutputAsVector(True) \
.setCleanAnnotations(False)
pipeline = Pipeline() \
.setStages([
documentAssembler,
tokenizer,
embeddings,
embeddingsFinisher
])
data = spark.createDataFrame([["The patient was diagnosed with diabetes."]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("explode(finished_embeddings) as result").show(truncate=False)
+----------------------------------------------------------------------------------+
|result |
+----------------------------------------------------------------------------------+
|[0.9439099431037903,0.4707513153553009,0.806300163269043,0.16176554560661316] |
|[0.7966810464859009,0.5551124811172485,0.8861005902290344,0.28284206986427307] |
|[0.025029370561242104,0.35177749395370483,0.052506182342767715,0.1887107789516449]|
|[0.08617766946554184,0.8399239182472229,0.5395117998123169,0.7864698767662048] |
|[0.6599600911140442,0.16109347343444824,0.6041093468666077,0.8913561105728149] |
|[0.5955275893211365,0.01899011991918087,0.4397728443145752,0.8911281824111938] |
|[0.9840458631515503,0.7599489092826843,0.9417727589607239,0.8624503016471863] |
+----------------------------------------------------------------------------------+
// In this example, the file `random_embeddings_dim4.txt` has the form of the content above.
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.embeddings.WordEmbeddings
import com.johnsnowlabs.nlp.util.io.ReadAs
import com.johnsnowlabs.nlp.EmbeddingsFinisher
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = new WordEmbeddings()
.setStoragePath("src/test/resources/random_embeddings_dim4.txt", ReadAs.TEXT)
.setStorageRef("glove_4d")
.setDimension(4)
.setInputCols("document", "token")
.setOutputCol("embeddings")
val embeddingsFinisher = new EmbeddingsFinisher()
.setInputCols("embeddings")
.setOutputCols("finished_embeddings")
.setOutputAsVector(true)
.setCleanAnnotations(false)
val pipeline = new Pipeline()
.setStages(Array(
documentAssembler,
tokenizer,
embeddings,
embeddingsFinisher
))
val data = Seq("The patient was diagnosed with diabetes.").toDF("text")
val result = pipeline.fit(data).transform(data)
result.selectExpr("explode(finished_embeddings) as result").show(false)
+----------------------------------------------------------------------------------+
|result |
+----------------------------------------------------------------------------------+
|[0.9439099431037903,0.4707513153553009,0.806300163269043,0.16176554560661316] |
|[0.7966810464859009,0.5551124811172485,0.8861005902290344,0.28284206986427307] |
|[0.025029370561242104,0.35177749395370483,0.052506182342767715,0.1887107789516449]|
|[0.08617766946554184,0.8399239182472229,0.5395117998123169,0.7864698767662048] |
|[0.6599600911140442,0.16109347343444824,0.6041093468666077,0.8913561105728149] |
|[0.5955275893211365,0.01899011991918087,0.4397728443145752,0.8911281824111938] |
|[0.9840458631515503,0.7599489092826843,0.9417727589607239,0.8624503016471863] |
+----------------------------------------------------------------------------------+
WordSegmenter
Trains a WordSegmenter which tokenizes non-english or non-whitespace separated texts.
Many languages are not whitespace separated and their sentences are a concatenation of many symbols, like Korean, Japanese or Chinese. Without understanding the language, splitting the words into their corresponding tokens is impossible. The WordSegmenter is trained to understand these languages and split them into semantically correct parts.
This annotator is based on the paper Chinese Word Segmentation as Character Tagging [1]. Word segmentation is treated as a tagging problem. Each character is be tagged as on of four different labels: LL (left boundary), RR (right boundary), MM (middle) and LR (word by itself). The label depends on the position of the word in the sentence. LL tagged words will combine with the word on the right. Likewise, RR tagged words combine with words on the left. MM tagged words are treated as the middle of the word and combine with either side. LR tagged words are words by themselves.
Example (from [1], Example 3(a) (raw), 3(b) (tagged), 3(c) (translation)):
- 上海 计划 到 本 世纪 末 实现 人均 国内 生产 总值 五千 美元
- 上/LL 海/RR 计/LL 划/RR 到/LR 本/LR 世/LL 纪/RR 末/LR 实/LL 现/RR 人/LL 均/RR 国/LL 内/RR 生/LL 产/RR 总/LL 值/RR 五/LL 千/RR 美/LL 元/RR
- Shanghai plans to reach the goal of 5,000 dollars in per capita GDP by the end of the century.
For instantiated/pretrained models, see WordSegmenterModel.
To train your own model, a training dataset consisting of
Part-Of-Speech tags is required. The
data has to be loaded into a dataframe, where the column is an
Annotation of type "POS"
. This can be set with
setPosColumn
.
Tip: The helper class POS might be useful to read training data into data frames.
For extended examples of usage, see the Examples and the WordSegmenterTest.
References:
- [1] Xue, Nianwen. “Chinese Word Segmentation as Character Tagging.” International Journal of Computational Linguistics & Chinese Language Processing, Volume 8, Number 1, February 2003: Special Issue on Word Formation and Chinese Language Processing, 2003, pp. 29-48. ACLWeb, https://aclanthology.org/O03-4002.
Input Annotator Types: DOCUMENT
Output Annotator Type: TOKEN
Python API: WordSegmenterApproach | Scala API: WordSegmenterApproach | Source: WordSegmenterApproach |
Show Example
# In this example, `"chinese_train.utf8"` is in the form of
#
# 十|LL 四|RR 不|LL 是|RR 四|LL 十|RR
#
# and is loaded with the `POS` class to create a dataframe of `"POS"` type Annotations.
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
wordSegmenter = WordSegmenterApproach() \
.setInputCols(["document"]) \
.setOutputCol("token") \
.setPosColumn("tags") \
.setNIterations(5)
pipeline = Pipeline().setStages([
documentAssembler,
wordSegmenter
])
trainingDataSet = POS().readDataset(
spark,
"src/test/resources/word-segmenter/chinese_train.utf8"
)
pipelineModel = pipeline.fit(trainingDataSet)
// In this example, `"chinese_train.utf8"` is in the form of
//
// 十|LL 四|RR 不|LL 是|RR 四|LL 十|RR
//
// and is loaded with the `POS` class to create a dataframe of `"POS"` type Annotations.
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.ws.WordSegmenterApproach
import com.johnsnowlabs.nlp.training.POS
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val wordSegmenter = new WordSegmenterApproach()
.setInputCols("document")
.setOutputCol("token")
.setPosColumn("tags")
.setNIterations(5)
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
wordSegmenter
))
val trainingDataSet = POS().readDataset(
ResourceHelper.spark,
"src/test/resources/word-segmenter/chinese_train.utf8"
)
val pipelineModel = pipeline.fit(trainingDataSet)
YakeKeywordExtraction
Yake is an Unsupervised, Corpus-Independent, Domain and Language-Independent and Single-Document keyword extraction algorithm.
Extracting keywords from texts has become a challenge for individuals and organizations as the information grows in complexity and size. The need to automate this task so that text can be processed in a timely and adequate manner has led to the emergence of automatic keyword extraction tools. Yake is a novel feature-based system for multi-lingual keyword extraction, which supports texts of different sizes, domain or languages. Unlike other approaches, Yake does not rely on dictionaries nor thesauri, neither is trained against any corpora. Instead, it follows an unsupervised approach which builds upon features extracted from the text, making it thus applicable to documents written in different languages without the need for further knowledge. This can be beneficial for a large number of tasks and a plethora of situations where access to training corpora is either limited or restricted. The algorithm makes use of the position of a sentence and token. Therefore, to use the annotator, the text should be first sent through a Sentence Boundary Detector and then a tokenizer.
Note that each keyword will be given a keyword score greater than 0 (The lower the score better the keyword).
Therefore to filter the keywords, an upper bound for the score can be set with setThreshold
.
For extended examples of usage, see the Examples and the YakeTestSpec.
Sources :
Paper abstract:
As the amount of generated information grows, reading and summarizing texts of large collections turns into a challenging task. Many documents do not come with descriptive terms, thus requiring humans to generate keywords on-the-fly. The need to automate this kind of task demands the development of keyword extraction systems with the ability to automatically identify keywords within the text. One approach is to resort to machine-learning algorithms. These, however, depend on large annotated text corpora, which are not always available. An alternative solution is to consider an unsupervised approach. In this article, we describe YAKE!, a light-weight unsupervised automatic keyword extraction method which rests on statistical text features extracted from single documents to select the most relevant keywords of a text. Our system does not need to be trained on a particular set of documents, nor does it depend on dictionaries, external corpora, text size, language, or domain. To demonstrate the merits and significance of YAKE!, we compare it against ten state-of-the-art unsupervised approaches and one supervised method. Experimental results carried out on top of twenty datasets show that YAKE! significantly outperforms other unsupervised methods on texts of different sizes, languages, and domains.
Input Annotator Types: TOKEN
Output Annotator Type: CHUNK
Python API: YakeKeywordExtraction | Scala API: YakeKeywordExtraction | Source: YakeKeywordExtraction |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
token = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token") \
.setContextChars(["(", "]", "?", "!", ".", ","])
keywords = YakeKeywordExtraction() \
.setInputCols(["token"]) \
.setOutputCol("keywords") \
.setThreshold(0.6) \
.setMinNGrams(2) \
.setNKeywords(10)
pipeline = Pipeline().setStages([
documentAssembler,
sentenceDetector,
token,
keywords
])
data = spark.createDataFrame([[
"Sources tell us that Google is acquiring Kaggle, a platform that hosts data science and machine learning competitions. Details about the transaction remain somewhat vague, but given that Google is hosting its Cloud Next conference in San Francisco this week, the official announcement could come as early as tomorrow. Reached by phone, Kaggle co-founder CEO Anthony Goldbloom declined to deny that the acquisition is happening. Google itself declined 'to comment on rumors'. Kaggle, which has about half a million data scientists on its platform, was founded by Goldbloom and Ben Hamner in 2010. The service got an early start and even though it has a few competitors like DrivenData, TopCoder and HackerRank, it has managed to stay well ahead of them by focusing on its specific niche. The service is basically the de facto home for running data science and machine learning competitions. With Kaggle, Google is buying one of the largest and most active communities for data scientists - and with that, it will get increased mindshare in this community, too (though it already has plenty of that thanks to Tensorflow and other projects). Kaggle has a bit of a history with Google, too, but that's pretty recent. Earlier this month, Google and Kaggle teamed up to host a $100,000 machine learning competition around classifying YouTube videos. That competition had some deep integrations with the Google Cloud Platform, too. Our understanding is that Google will keep the service running - likely under its current name. While the acquisition is probably more about Kaggle's community than technology, Kaggle did build some interesting tools for hosting its competition and 'kernels', too. On Kaggle, kernels are basically the source code for analyzing data sets and developers can share this code on the platform (the company previously called them 'scripts'). Like similar competition-centric sites, Kaggle also runs a job board, too. It's unclear what Google will do with that part of the service. According to Crunchbase, Kaggle raised $12.5 million (though PitchBook says it's $12.75) since its launch in 2010. Investors in Kaggle include Index Ventures, SV Angel, Max Levchin, NaRavikant, Google chie economist Hal Varian, Khosla Ventures and Yuri Milner"
]]).toDF("text")
result = pipeline.fit(data).transform(data)
# combine the result and score (contained in keywords.metadata)
scores = result \
.selectExpr("explode(arrays_zip(keywords.result, keywords.metadata)) as resultTuples") \
.selectExpr("resultTuples['0'] as keyword", "resultTuples['1'].score as score")
# Order ascending, as lower scores means higher importance
scores.orderBy("score").show(5, truncate = False)
+---------------------+-------------------+
|keyword |score |
+---------------------+-------------------+
|google cloud |0.32051516486864573|
|google cloud platform|0.37786450577630676|
|ceo anthony goldbloom|0.39922830978423146|
|san francisco |0.40224744669493756|
|anthony goldbloom |0.41584827825302534|
+---------------------+-------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.{SentenceDetector, Tokenizer}
import com.johnsnowlabs.nlp.annotators.keyword.yake.YakeKeywordExtraction
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val token = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
.setContextChars(Array("(", ")", "?", "!", ".", ","))
val keywords = new YakeKeywordExtraction()
.setInputCols("token")
.setOutputCol("keywords")
.setThreshold(0.6f)
.setMinNGrams(2)
.setNKeywords(10)
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
token,
keywords
))
val data = Seq(
"Sources tell us that Google is acquiring Kaggle, a platform that hosts data science and machine learning competitions. Details about the transaction remain somewhat vague, but given that Google is hosting its Cloud Next conference in San Francisco this week, the official announcement could come as early as tomorrow. Reached by phone, Kaggle co-founder CEO Anthony Goldbloom declined to deny that the acquisition is happening. Google itself declined 'to comment on rumors'. Kaggle, which has about half a million data scientists on its platform, was founded by Goldbloom and Ben Hamner in 2010. The service got an early start and even though it has a few competitors like DrivenData, TopCoder and HackerRank, it has managed to stay well ahead of them by focusing on its specific niche. The service is basically the de facto home for running data science and machine learning competitions. With Kaggle, Google is buying one of the largest and most active communities for data scientists - and with that, it will get increased mindshare in this community, too (though it already has plenty of that thanks to Tensorflow and other projects). Kaggle has a bit of a history with Google, too, but that's pretty recent. Earlier this month, Google and Kaggle teamed up to host a $100,000 machine learning competition around classifying YouTube videos. That competition had some deep integrations with the Google Cloud Platform, too. Our understanding is that Google will keep the service running - likely under its current name. While the acquisition is probably more about Kaggle's community than technology, Kaggle did build some interesting tools for hosting its competition and 'kernels', too. On Kaggle, kernels are basically the source code for analyzing data sets and developers can share this code on the platform (the company previously called them 'scripts'). Like similar competition-centric sites, Kaggle also runs a job board, too. It's unclear what Google will do with that part of the service. According to Crunchbase, Kaggle raised $12.5 million (though PitchBook says it's $12.75) since its launch in 2010. Investors in Kaggle include Index Ventures, SV Angel, Max Levchin, Naval Ravikant, Google chief economist Hal Varian, Khosla Ventures and Yuri Milner"
).toDF("text")
val result = pipeline.fit(data).transform(data)
// combine the result and score (contained in keywords.metadata)
val scores = result
.selectExpr("explode(arrays_zip(keywords.result, keywords.metadata)) as resultTuples")
.select($"resultTuples.0" as "keyword", $"resultTuples.1.score")
// Order ascending, as lower scores means higher importance
scores.orderBy("score").show(5, truncate = false)
+---------------------+-------------------+
|keyword |score |
+---------------------+-------------------+
|google cloud |0.32051516486864573|
|google cloud platform|0.37786450577630676|
|ceo anthony goldbloom|0.39922830978423146|
|san francisco |0.40224744669493756|
|anthony goldbloom |0.41584827825302534|
+---------------------+-------------------+