`sparknlp.annotator.embeddings.doc2vec`#

Contains classes for Doc2Vec.

Module Contents#

Classes#

`Doc2VecApproach`	Trains a Word2Vec model that creates vector representations of words in a
`Doc2VecModel`	Word2Vec model that creates vector representations of words in a text

class Doc2VecApproach[source]#

Trains a Word2Vec model that creates vector representations of words in a text corpus.

The algorithm first constructs a vocabulary from the corpus and then learns vector representation of words in the vocabulary. The vector representation can be used as features in natural language processing and machine learning algorithms.

We use Word2Vec implemented in Spark ML. It uses skip-gram model in our implementation and a hierarchical softmax method to train the model. The variable names in the implementation match the original C implementation.

For instantiated/pretrained models, see Doc2VecModel.

For available pretrained models please see the Models Hub.

Input Annotation types	Output Annotation type
`TOKEN`	`SENTENCE_EMBEDDINGS`

Parameters:

vectorSize: The dimension of codes after transforming from words (> 0), by default 100
windowSize: The window size (context words from [-window, window]) (> 0), by default 5
numPartitions: Number of partitions for sentences of words (> 0), by default 1
minCount: The minimum number of times a token must appear to be included in the word2vec model’s vocabulary (>= 0), by default 1
maxSentenceLength: The window size (Maximum length (in words) of each sentence in the input data. Any sentence longer than this threshold will be divided into chunks up to the size (> 0), by default 1000
stepSize: Step size (learning rate) to be used for each iteration of optimization (> 0), by default 0.025
maxIter: Maximum number of iterations (>= 0), by default 1
seed: Random seed, by default 44

References

For the original C implementation, see https://code.google.com/p/word2vec/

For the research paper, see Efficient Estimation of Word Representations in Vector Space and Distributed Representations of Words and Phrases and their Compositionality.

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["document"]) \
...     .setOutputCol("token")
>>> embeddings = Doc2VecApproach() \
...     .setInputCols(["token"]) \
...     .setOutputCol("embeddings")
>>> pipeline = Pipeline() \
...     .setStages([
...       documentAssembler,
...       tokenizer,
...       embeddings
...     ])
>>> path = "sherlockholmes.txt"
>>> dataset = spark.read.text(path).toDF("text")
>>> pipelineModel = pipeline.fit(dataset)

setVectorSize(vectorSize)[source]#: Sets vector size (default: 100).

setWindowSize(windowSize)[source]#: Sets window size (default: 5).

setStepSize(stepSize)[source]#: Sets initial learning rate (default: 0.025).

setNumPartitions(numPartitions)[source]#: Sets number of partitions (default: 1). Use a small number for accuracy.

setMaxIter(numIterations)[source]#: Sets number of iterations (default: 1), which should be smaller than or equal to number of partitions.

setSeed(seed)[source]#: Sets random seed.

setMinCount(minCount)[source]#: Sets minCount, the minimum number of times a token must appear to be included in the word2vec model’s vocabulary (default: 5).

setMaxSentenceLength(maxSentenceLength)[source]#: Maximum length (in words) of each sentence in the input data. Any sentence longer than this threshold will be divided into chunks up to the size (> 0)

class Doc2VecModel(classname='com.johnsnowlabs.nlp.embeddings.Doc2VecModel', java_model=None)[source]#

Word2Vec model that creates vector representations of words in a text corpus.

This is the instantiated model of the Doc2VecApproach. For training your own model, please see the documentation of that class.

Pretrained models can be loaded with pretrained() of the companion object:

>>> embeddings = Doc2VecModel.pretrained() \
...     .setInputCols(["token"]) \
...     .setOutputCol("embeddings")

The default model is “doc2vec_gigaword_300”, if no name is provided.

Input Annotation types	Output Annotation type
`TOKEN`	`SENTENCE_EMBEDDINGS`

Parameters:

vectorSize: The dimension of codes after transforming from words (> 0) , by default 100

References

For the original C implementation, see https://code.google.com/p/word2vec/

For the research paper, see Efficient Estimation of Word Representations in Vector Space and Distributed Representations of Words and Phrases and their Compositionality.

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["document"]) \
...     .setOutputCol("token")
>>> embeddings = Doc2VecModel.pretrained() \
...     .setInputCols(["token"]) \
...     .setOutputCol("embeddings")
>>> embeddingsFinisher = EmbeddingsFinisher() \
...     .setInputCols(["embeddings"]) \
...     .setOutputCols("finished_embeddings") \
...     .setOutputAsVector(True)
>>> pipeline = Pipeline().setStages([
...     documentAssembler,
...     tokenizer,
...     embeddings,
...     embeddingsFinisher
... ])
>>> data = spark.createDataFrame([["This is a sentence."]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.selectExpr("explode(finished_embeddings) as result").show(1, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[0.06222493574023247,0.011579325422644615,0.009919632226228714,0.109361454844...|
+--------------------------------------------------------------------------------+

setVectorSize(vectorSize)[source]#: Sets vector size (default: 100).

static pretrained(name='doc2vec_gigaword_300', lang='en', remote_loc=None)[source]#

Downloads and loads a pretrained model.

Parameters:

namestr, optional: Name of the pretrained model, by default “doc2vec_wiki”
langstr, optional: Language of the pretrained model, by default “en”
remote_locstr, optional: Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.

Returns:

Doc2VecModel: The restored model

getVectors()[source]#: Returns the vector representation of the words as a dataframe with two fields, word and vector.

sparknlp.annotator.embeddings.doc2vec#

Module Contents#

Classes#

`sparknlp.annotator.embeddings.doc2vec`#