sparknlp.base.embeddings_finisher#

Contains classes for the EmbeddingsFinisher.

Module Contents#

Classes#

EmbeddingsFinisher

Extracts embeddings from Annotations into a more easily usable form.

class EmbeddingsFinisher[source]#

Extracts embeddings from Annotations into a more easily usable form.

This is useful for example:

  • WordEmbeddings,

  • Transformer based embeddings such as BertEmbeddings,

  • SentenceEmbeddings and

  • ChunkEmbeddings, etc.

By using EmbeddingsFinisher you can easily transform your embeddings into array of floats or vectors which are compatible with Spark ML functions such as LDA, K-mean, Random Forest classifier or any other functions that require a featureCol.

For more extended examples see the `Examples <JohnSnowLabs/spark-nlp

>`__.

Input Annotation types

Output Annotation type

EMBEDDINGS

NONE

Parameters:
inputCols

Names of input annotation columns containing embeddings

outputCols

Names of finished output columns

cleanAnnotations

Whether to remove all the existing annotation columns, by default False

outputAsVector

Whether to output the embeddings as Vectors instead of arrays, by default False

See also

EmbeddingsFinisher

for finishing embeddings

Examples

First extract embeddings.

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...    .setInputCol("text") \
...    .setOutputCol("document")
>>> tokenizer = Tokenizer() \
...    .setInputCols("document") \
...    .setOutputCol("token")
>>> normalizer = Normalizer() \
...    .setInputCols("token") \
...    .setOutputCol("normalized")
>>> stopwordsCleaner = StopWordsCleaner() \
...    .setInputCols("normalized") \
...    .setOutputCol("cleanTokens") \
...    .setCaseSensitive(False)
>>> gloveEmbeddings = WordEmbeddingsModel.pretrained() \
...    .setInputCols("document", "cleanTokens") \
...    .setOutputCol("embeddings") \
...    .setCaseSensitive(False)
>>> embeddingsFinisher = EmbeddingsFinisher() \
...    .setInputCols("embeddings") \
...    .setOutputCols("finished_sentence_embeddings") \
...    .setOutputAsVector(True) \
...    .setCleanAnnotations(False)
>>> data = spark.createDataFrame([["Spark NLP is an open-source text processing library."]]) \
...    .toDF("text")
>>> pipeline = Pipeline().setStages([
...    documentAssembler,
...    tokenizer,
...    normalizer,
...    stopwordsCleaner,
...    gloveEmbeddings,
...    embeddingsFinisher
... ]).fit(data)
>>> result = pipeline.transform(data)

Show results.

>>> resultWithSize = result.selectExpr("explode(finished_sentence_embeddings) as embeddings")
>>> resultWithSize.show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                      embeddings|
+--------------------------------------------------------------------------------+
|[0.1619900017976761,0.045552998781204224,-0.03229299932718277,-0.685609996318...|
|[-0.42416998744010925,1.1378999948501587,-0.5717899799346924,-0.5078899860382...|
|[0.08621499687433243,-0.15772999823093414,-0.06067200005054474,0.395359992980...|
|[-0.4970499873161316,0.7164199948310852,0.40119001269340515,-0.05761000141501...|
|[-0.08170200139284134,0.7159299850463867,-0.20677000284194946,0.0295659992843...|
+--------------------------------------------------------------------------------+
setInputCols(*value)[source]#

Sets name of input annotation columns containing embeddings.

Parameters:
*valuestr

Input columns for the annotator

setOutputCols(*value)[source]#

Sets names of finished output columns.

Parameters:
*valueList[str]

Input columns for the annotator

setCleanAnnotations(value)[source]#

Sets whether to remove all the existing annotation columns, by default False.

Parameters:
valuebool

Whether to remove all the existing annotation columns

setOutputAsVector(value)[source]#

Sets whether to output the embeddings as Vectors instead of arrays, by default False.

Parameters:
valuebool

Whether to output the embeddings as Vectors instead of arrays

getInputCols()[source]#

Gets input columns name of annotations.

getOutputCols()[source]#

Gets output columns name of annotations.