sparknlp.annotator.embeddings.chunk_embeddings
#
Contains classes for ChunkEmbeddings
Module Contents#
Classes#
This annotator utilizes WordEmbeddings, BertEmbeddings etc. to generate |
- class ChunkEmbeddings[source]#
This annotator utilizes WordEmbeddings, BertEmbeddings etc. to generate chunk embeddings from either Chunker, NGramGenerator, or NerConverter outputs.
For extended examples of usage, see the Examples.
Input Annotation types
Output Annotation type
CHUNK, WORD_EMBEDDINGS
WORD_EMBEDDINGS
- Parameters:
- poolingStrategy
Choose how you would like to aggregate Word Embeddings to Chunk Embeddings, by default AVERAGE. Possible Values:
AVERAGE, SUM
- skipOOV
Whether to discard default vectors for OOV words from the aggregation/pooling.
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline
Extract the Embeddings from the NGrams
>>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> sentence = SentenceDetector() \ ... .setInputCols(["document"]) \ ... .setOutputCol("sentence") >>> tokenizer = Tokenizer() \ ... .setInputCols(["sentence"]) \ ... .setOutputCol("token") >>> nGrams = NGramGenerator() \ ... .setInputCols(["token"]) \ ... .setOutputCol("chunk") \ ... .setN(2) >>> embeddings = WordEmbeddingsModel.pretrained() \ ... .setInputCols(["sentence", "token"]) \ ... .setOutputCol("embeddings") \ ... .setCaseSensitive(False)
Convert the NGram chunks into Word Embeddings
>>> chunkEmbeddings = ChunkEmbeddings() \ ... .setInputCols(["chunk", "embeddings"]) \ ... .setOutputCol("chunk_embeddings") \ ... .setPoolingStrategy("AVERAGE") >>> pipeline = Pipeline() \ ... .setStages([ ... documentAssembler, ... sentence, ... tokenizer, ... nGrams, ... embeddings, ... chunkEmbeddings ... ]) >>> data = spark.createDataFrame([["This is a sentence."]]).toDF("text") >>> result = pipeline.fit(data).transform(data) >>> result.selectExpr("explode(chunk_embeddings) as result") \ ... .select("result.annotatorType", "result.result", "result.embeddings") \ ... .show(5, 80) +---------------+----------+--------------------------------------------------------------------------------+ | annotatorType| result| embeddings| +---------------+----------+--------------------------------------------------------------------------------+ |word_embeddings| This is|[-0.55661, 0.42829502, 0.86661, -0.409785, 0.06316501, 0.120775, -0.0732005, ...| |word_embeddings| is a|[-0.40674996, 0.22938299, 0.50597, -0.288195, 0.555655, 0.465145, 0.140118, 0...| |word_embeddings|a sentence|[0.17417, 0.095253006, -0.0530925, -0.218465, 0.714395, 0.79860497, 0.0129999...| |word_embeddings|sentence .|[0.139705, 0.177955, 0.1887775, -0.45545, 0.20030999, 0.461557, -0.07891501, ...| +---------------+----------+--------------------------------------------------------------------------------+