Spark NLP Annotation functions
The functions presented here help users manipulate annotations, by providing both UDFs and dataframe utilities to deal with them more easily
Python
In python, the functions are straight forward and have both UDF and Dataframe applications
map_annotations(f, output_type: DataType)
UDF that applies f(). Requires output DataType from pyspark.sql.typesmap_annotations_strict(f)
UDF that apples an f() method that returns a list of Annotationsmap_annotations_col(dataframe: DataFrame, f, column: str, output_column: str, annotatyon_type: str, output_type: DataType = Annotation.arrayType())
applies f() tocolumn
fromdataframe
map_annotations_cols(dataframe: DataFrame, f, columns: str, output_column: str, annotatyon_type: str, output_type: DataType = Annotation.arrayType())
applies f() tocolumns
fromdataframe
filter_by_annotations_col(dataframe, f, column)
applies a boolean filter f() tocolumn
fromdataframe
explode_annotations_col(dataframe: DataFrame, column, output_column)
explodes annotationcolumn
fromdataframe
Scala
In Scala, importing inner functions brings implicits that allow these functions to be applied directly on top of the dataframe
mapAnnotations(function: Seq[Annotation] => T, outputType: DataType)
mapAnnotationsStrict(function: Seq[Annotation] => Seq[Annotation])
mapAnnotationsCol[T: TypeTag](column: String, outputCol: String,annotatorType: String, function: Seq[Annotation] => T)
mapAnnotationsCol[T: TypeTag](cols: Seq[String], outputCol: String,annotatorType: String, function: Seq[Annotation] => T)
eachAnnotationsCol[T: TypeTag](column: String, function: Seq[Annotation] => Unit)
def explodeAnnotationsCol[T: TypeTag](column: String, outputCol: String)
Imports:
from sparknlp.functions import *
from sparknlp.annotation import Annotation
import com.johnsnowlabs.nlp.functions._
import com.johnsnowlabs.nlp.Annotation
Examples:
Complete usage examples can be seen here: https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/234-release-candidate/jupyter/annotation/english/spark-nlp-basics/spark-nlp-basics-functions.ipynb
def my_annoation_map_function(annotations):
return list(map(lambda a: Annotation(
'my_own_type',
a.begin,
a.end,
a.result,
{'my_key': 'custom_annotation_data'},
[]), annotations))
result.select(
map_annotations(my_annoation_map_function, Annotation.arrayType())('token')
).toDF("my output").show(truncate=False)
val modified = data.mapAnnotationsCol("pos", "mod_pos","pos" ,(_: Seq[Annotation]) => {
"hello world"
})
PREVIOUSGPU vs CPU Training