Annotators#

Annotators are the spearhead of NLP functions in Spark NLP. Let’s take the ClassifierDL Annotators as an example. There are two forms of annotators:

Annotator Approaches#

Annotator Approaches are those who represent a Spark ML Estimator and require a training stage. They have a function called fit(data) which trains a model based on some data. They produce the second type of annotator which is an annotator model or transformer.

Example

First we need to declare all the prerequisite steps and the training data:

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> useEmbeddings = UniversalSentenceEncoder.pretrained() \
...     .setInputCols(["document"]) \
...     .setOutputCol("sentence_embeddings")

In this example, the training data "sentiment.csv" has the form of:

text,label
This movie is the best movie I have wached ever! In my opinion this movie can win an award.,0
This was a terrible movie! The acting was bad really bad!,1
...

and will be loaded with Spark:

>>> smallCorpus = spark.read.option("header","True").csv("src/test/resources/classifier/sentiment.csv")

Then we declare the ClassifierDLApproach that is going to be trained in the pipeline. Note that in this case, the Annotator also requires a label column, set with setLabelColumn("label"), to classify the text.

>>> docClassifier = ClassifierDLApproach() \
...     .setInputCols(["sentence_embeddings"]) \
...     .setOutputCol("category") \
...     .setLabelColumn("label") \
...     .setBatchSize(64) \
...     .setMaxEpochs(20) \
...     .setLr(5e-3) \
...     .setDropout(0.5)
>>> pipeline = Pipeline().setStages([
...     documentAssembler,
...     useEmbeddings,
...     docClassifier
... ])

Finally the data is fit to the pipeline and the Annotator is trained:

>>> pipelineModel = pipeline.fit(smallCorpus)

The result is a PipelineModel that can be used with transform(data) to classify sentiment.

Annotator Models#

Annotator Models are Spark models or transformers, meaning they have a transform(data) function. This function takes as input a dataframe to which it adds a new column containing the result of the current annotation. All transformers are additive, meaning they append to current data, never replace or delete previous information.

Both forms of annotators can be included in a Pipeline. All annotators included in a Pipeline will be automatically executed in the defined order and will transform the data accordingly. A Pipeline is turned into a PipelineModel after the fit() stage. The Pipeline can be saved to disk and re-loaded at any time.

Note#

The Model suffix is explicitly stated when the annotator is the result of a training process. Some annotators, such as Tokenizer are transformers, but do not contain the word Model since they are not trained annotators.

Example

First we need to declare all the prerequisite steps:

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> sentence = SentenceDetector() \
...     .setInputCols("document") \
...     .setOutputCol("sentence")
>>> useEmbeddings = UniversalSentenceEncoder.pretrained() \
...    .setInputCols("document") \
...    .setOutputCol("sentence_embeddings")

Here we use a pretrained ClassifierDLModel. Your trained approach from the previous example could also be inserted.

>>> sarcasmDL = ClassifierDLModel.pretrained("classifierdl_use_sarcasm") \
...     .setInputCols("sentence_embeddings") \
...     .setOutputCol("sarcasm")
>>> pipeline = Pipeline().setStages([
...     documentAssembler,
...     sentence,
...     useEmbeddings,
...     sarcasmDL
... ])

Then we can create some data to classify and use transform(data) to get the results.

>>> data = spark.createDataFrame([
...     ["I'm ready!"],
...     ["If I could put into words how much I love waking up at 6 am on Mondays I would."]
... ]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.selectExpr("explode(arrays_zip(sentence, sarcasm)) as out") \
...     .selectExpr("out.sentence.result as sentence", "out.sarcasm.result as sarcasm") \
...     .show(truncate=False)
+-------------------------------------------------------------------------------+-------+
|sentence                                                                       |sarcasm|
+-------------------------------------------------------------------------------+-------+
|I'm ready!                                                                     |normal |
|If I could put into words how much I love waking up at 6 am on Mondays I would.|sarcasm|
+-------------------------------------------------------------------------------+-------+

Pretrained Models#

Model annotators have a pretrained() on it’s static object, to retrieve the public pre-trained version of a model.

>>> import sparknlp
>>> from sparknlp.annotator import *
>>> classifierDL = ClassifierDLModel.pretrained() \
...     .setInputCols(["sentence_embeddings"]) \
...     .setOutputCol("classification")

pretrained(name, language, extra_location) will by default, bring a default pre-trained model. Sometimes we offer more than one model, in which case, you may have to use name, language or extra location to download them.

For a complete list of available pretrained models, head to the Spark NLP Models. Alternatively you can also check for pretrained models of a particular annotator using ResourceDownloader.showPublicModels().

>>> ResourceDownloader.showPublicModels("ClassifierDLModel", "en")
+-------------------------+------+---------+
| Model                   | lang | version |
+-------------------------+------+---------+
| classifierdl_use_trec6  |  en  | 2.5.0   |
| classifierdl_use_trec50 |  en  | 2.5.0   |
| classifierdl_use_spam   |  en  | 2.5.3   |
| ...                     |  en  | ...     |

Common Functions#

setInputCols(column_names)
Takes a list of column names of annotations required by this annotator. Those are generated by the annotators which precede the current annotator in the pipeline.
setOutputCol(column_name)
Defines the name of the column containing the result of the current annotator. Use this name as an input for other annotators down the pipeline requiring the outputs generated by the current annotator.

Available Annotators#

For all available Annotators refer to the full API reference sparknlp.annotator.