Annotation#

The basic result of a Spark NLP operation is an annotation. It’s structure includes:

annotatorType: the type of annotator that generated the current annotation
begin: the begin of the matched content relative to raw-text
end: the end of the matched content relative to raw-text
result: the main output of the annotation
metadata: content of matched result and additional information
embeddings: (new in 2.0) contains vector mappings if required

This object is automatically generated by annotators after a transform process. No manual work is required. However, it is important to clearly understand the structure of an annotation to be able too efficiently use it.

For example, the annotation could look like this (using Pretrained Pipelines):

>>> from sparknlp.pretrained import PretrainedPipeline
>>> explain_document_pipeline = PretrainedPipeline("explain_document_ml")
explain_document_ml download started this may take some time.
Approx size to download 9.1 MB
[OK!]
>>> data = spark.createDataFrame([["We are very happy about Spark NLP"]]).toDF("text")
>>> result = explain_document_pipeline.model.transform(data).selectExpr("explode(pos)")
>>> result.show(truncate=False)
+---------------------------------------+
|col                                    |
+---------------------------------------+
|[pos, 0, 1, PRP, [word -> We], []]     |
|[pos, 3, 5, VBP, [word -> are], []]    |
|[pos, 7, 10, RB, [word -> very], []]   |
|[pos, 12, 16, JJ, [word -> happy], []] |
|[pos, 18, 22, IN, [word -> about], []] |
|[pos, 24, 28, NNP, [word -> Spark], []]|
|[pos, 30, 32, NNP, [word -> NLP], []]  |
+---------------------------------------+