sparknlp.base.embeddings_finisher
#
Contains classes for the EmbeddingsFinisher.
Module Contents#
Classes#
Extracts embeddings from Annotations into a more easily usable form. |
- class EmbeddingsFinisher[source]#
Extracts embeddings from Annotations into a more easily usable form.
This is useful for example:
WordEmbeddings,
Transformer based embeddings such as BertEmbeddings,
SentenceEmbeddings and
ChunkEmbeddings, etc.
By using
EmbeddingsFinisher
you can easily transform your embeddings into array of floats or vectors which are compatible with Spark ML functions such as LDA, K-mean, Random Forest classifier or any other functions that require afeatureCol
.For more extended examples see the `Examples <JohnSnowLabs/spark-nlp
>`__.
Input Annotation types
Output Annotation type
EMBEDDINGS
NONE
- Parameters:
- inputCols
Names of input annotation columns containing embeddings
- outputCols
Names of finished output columns
- cleanAnnotations
Whether to remove all the existing annotation columns, by default False
- outputAsVector
Whether to output the embeddings as Vectors instead of arrays, by default False
See also
EmbeddingsFinisher
for finishing embeddings
Examples
First extract embeddings.
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> tokenizer = Tokenizer() \ ... .setInputCols("document") \ ... .setOutputCol("token") >>> normalizer = Normalizer() \ ... .setInputCols("token") \ ... .setOutputCol("normalized") >>> stopwordsCleaner = StopWordsCleaner() \ ... .setInputCols("normalized") \ ... .setOutputCol("cleanTokens") \ ... .setCaseSensitive(False) >>> gloveEmbeddings = WordEmbeddingsModel.pretrained() \ ... .setInputCols("document", "cleanTokens") \ ... .setOutputCol("embeddings") \ ... .setCaseSensitive(False) >>> embeddingsFinisher = EmbeddingsFinisher() \ ... .setInputCols("embeddings") \ ... .setOutputCols("finished_sentence_embeddings") \ ... .setOutputAsVector(True) \ ... .setCleanAnnotations(False) >>> data = spark.createDataFrame([["Spark NLP is an open-source text processing library."]]) \ ... .toDF("text") >>> pipeline = Pipeline().setStages([ ... documentAssembler, ... tokenizer, ... normalizer, ... stopwordsCleaner, ... gloveEmbeddings, ... embeddingsFinisher ... ]).fit(data) >>> result = pipeline.transform(data)
Show results.
>>> resultWithSize = result.selectExpr("explode(finished_sentence_embeddings) as embeddings") >>> resultWithSize.show(5, 80) +--------------------------------------------------------------------------------+ | embeddings| +--------------------------------------------------------------------------------+ |[0.1619900017976761,0.045552998781204224,-0.03229299932718277,-0.685609996318...| |[-0.42416998744010925,1.1378999948501587,-0.5717899799346924,-0.5078899860382...| |[0.08621499687433243,-0.15772999823093414,-0.06067200005054474,0.395359992980...| |[-0.4970499873161316,0.7164199948310852,0.40119001269340515,-0.05761000141501...| |[-0.08170200139284134,0.7159299850463867,-0.20677000284194946,0.0295659992843...| +--------------------------------------------------------------------------------+
- setInputCols(*value)[source]#
Sets name of input annotation columns containing embeddings.
- Parameters:
- *valuestr
Input columns for the annotator
- setOutputCols(*value)[source]#
Sets names of finished output columns.
- Parameters:
- *valueList[str]
Input columns for the annotator
- setCleanAnnotations(value)[source]#
Sets whether to remove all the existing annotation columns, by default False.
- Parameters:
- valuebool
Whether to remove all the existing annotation columns