sparknlp.annotator.embeddings.chunk_embeddings#

Contains classes for ChunkEmbeddings

Module Contents#

Classes#

ChunkEmbeddings

This annotator utilizes WordEmbeddings, BertEmbeddings etc. to generate

class ChunkEmbeddings[source]#

This annotator utilizes WordEmbeddings, BertEmbeddings etc. to generate chunk embeddings from either Chunker, NGramGenerator, or NerConverter outputs.

For extended examples of usage, see the Examples.

Input Annotation types

Output Annotation type

CHUNK, WORD_EMBEDDINGS

WORD_EMBEDDINGS

Parameters:
poolingStrategy

Choose how you would like to aggregate Word Embeddings to Chunk Embeddings, by default AVERAGE. Possible Values: AVERAGE, SUM

skipOOV

Whether to discard default vectors for OOV words from the aggregation/pooling.

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline

Extract the Embeddings from the NGrams

>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> sentence = SentenceDetector() \
...     .setInputCols(["document"]) \
...     .setOutputCol("sentence")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["sentence"]) \
...     .setOutputCol("token")
>>> nGrams = NGramGenerator() \
...     .setInputCols(["token"]) \
...     .setOutputCol("chunk") \
...     .setN(2)
>>> embeddings = WordEmbeddingsModel.pretrained() \
...     .setInputCols(["sentence", "token"]) \
...     .setOutputCol("embeddings") \
...     .setCaseSensitive(False)

Convert the NGram chunks into Word Embeddings

>>> chunkEmbeddings = ChunkEmbeddings() \
...     .setInputCols(["chunk", "embeddings"]) \
...     .setOutputCol("chunk_embeddings") \
...     .setPoolingStrategy("AVERAGE")
>>> pipeline = Pipeline() \
...     .setStages([
...       documentAssembler,
...       sentence,
...       tokenizer,
...       nGrams,
...       embeddings,
...       chunkEmbeddings
...     ])
>>> data = spark.createDataFrame([["This is a sentence."]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.selectExpr("explode(chunk_embeddings) as result") \
...     .select("result.annotatorType", "result.result", "result.embeddings") \
...     .show(5, 80)
+---------------+----------+--------------------------------------------------------------------------------+
|  annotatorType|    result|                                                                      embeddings|
+---------------+----------+--------------------------------------------------------------------------------+
|word_embeddings|   This is|[-0.55661, 0.42829502, 0.86661, -0.409785, 0.06316501, 0.120775, -0.0732005, ...|
|word_embeddings|      is a|[-0.40674996, 0.22938299, 0.50597, -0.288195, 0.555655, 0.465145, 0.140118, 0...|
|word_embeddings|a sentence|[0.17417, 0.095253006, -0.0530925, -0.218465, 0.714395, 0.79860497, 0.0129999...|
|word_embeddings|sentence .|[0.139705, 0.177955, 0.1887775, -0.45545, 0.20030999, 0.461557, -0.07891501, ...|
+---------------+----------+--------------------------------------------------------------------------------+
setPoolingStrategy(strategy)[source]#

Sets how to aggregate Word Embeddings to Chunk Embeddings, by default AVERAGE.

Possible Values: AVERAGE, SUM

Parameters:
strategystr

Aggregation Strategy

setSkipOOV(value)[source]#

Sets whether to discard default vectors for OOV words from the aggregation/pooling.

Parameters:
valuebool

whether to discard default vectors for OOV words from the aggregation/pooling.