Description
GloVe (Global Vectors) is a model for distributed word representation. This is achieved by mapping words into a meaningful space where the distance between words is related to semantic similarity. It outperformed many common Word2vec models on the word analogy task. One benefit of GloVe is that it is the result of directly modeling relationships, instead of getting them as a side effect of training a language model.
How to use
...
embeddings = WordEmbeddings.pretrained("glove_6B_300", "xx") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['I love Spark NLP']], ["text"]))
val embeddings = WordEmbeddings.pretrained("glove_6B_300", "xx")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
val data = Seq("I love Spark NLP").toDF("text")
val result = pipeline.fit(data).transform(data)
import nlu
text = ["""I love Spark NLP"""]
glove_df = nlu.load('xx.embed.glove.6B_300').predict(text)
glove_df
Results
token | glove_embeddings |
-------|----------------------------------------------------|
I | [0.1941000074148178, 0.22603000700473785, -0.4...] |
love | [0.13948999345302582, 0.534529983997345, -0.25...] |
Spark | [0.20353999733924866, 0.6292600035667419, 0.27...] |
NLP | [0.059436000883579254, 0.18411000072956085, -0...] |
Model Information
Model Name: | glove_6B_300 |
Type: | embeddings |
Compatibility: | Spark NLP 2.4.0+ |
License: | Open Source |
Edition: | Official |
Input Labels: | [sentence, token] |
Output Labels: | [word_embeddings] |
Language: | [xx] |
Dimension: | 300 |
Case sensitive: | false |
Data Source
The model is imported from https://nlp.stanford.edu/projects/glove/