class Doc2VecApproach extends AnnotatorApproach[Doc2VecModel] with HasStorageRef with HasEnableCachingProperties with HasProtectedParams
Trains a Word2Vec model that creates vector representations of words in a text corpus.
The algorithm first constructs a vocabulary from the corpus and then learns vector representation of words in the vocabulary. The vector representation can be used as features in natural language processing and machine learning algorithms.
We use Word2Vec implemented in Spark ML. It uses skip-gram model in our implementation and a hierarchical softmax method to train the model. The variable names in the implementation match the original C implementation.
For instantiated/pretrained models, see Doc2VecModel.
Sources :
For the original C implementation, see https://code.google.com/p/word2vec/
For the research paper, see Efficient Estimation of Word Representations in Vector Space and Distributed Representations of Words and Phrases and their Compositionality.
Example
import spark.implicits._ import com.johnsnowlabs.nlp.annotator.{Tokenizer, Doc2VecApproach} import com.johnsnowlabs.nlp.base.DocumentAssembler import org.apache.spark.ml.Pipeline val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = new Doc2VecApproach() .setInputCols("token") .setOutputCol("embeddings") val pipeline = new Pipeline() .setStages(Array( documentAssembler, tokenizer, embeddings )) val path = "src/test/resources/spell/sherlockholmes.txt" val dataset = spark.sparkContext.textFile(path) .toDF("text") val pipelineModel = pipeline.fit(dataset)
- Grouped
- Alphabetic
- By Inheritance
- Doc2VecApproach
- HasProtectedParams
- HasEnableCachingProperties
- HasStorageRef
- ParamsAndFeaturesWritable
- HasFeatures
- AnnotatorApproach
- CanBeLazy
- DefaultParamsWritable
- MLWritable
- HasOutputAnnotatorType
- HasOutputAnnotationCol
- HasInputAnnotationCols
- Estimator
- PipelineStage
- Logging
- Params
- Serializable
- Serializable
- Identifiable
- AnyRef
- Any
- Hide All
- Show All
- Public
- All
Parameters
A list of (hyper-)parameter keys this annotator can take. Users can set and get the parameter values through setters and getters, respectively.
-
val
enableCaching: BooleanParam
Whether to enable caching DataFrames or RDDs during the training
Whether to enable caching DataFrames or RDDs during the training
- Definition Classes
- HasEnableCachingProperties
-
val
maxIter: IntParam
Param for maximum number of iterations (>= 0) (Default:
1
) -
val
maxSentenceLength: IntParam
Sets the maximum length (in words) of each sentence in the input data (Default:
1000
).Sets the maximum length (in words) of each sentence in the input data (Default:
1000
). Any sentence longer than this threshold will be divided into chunks of up tomaxSentenceLength
size. -
val
minCount: IntParam
The minimum number of times a token must appear to be included in the word2vec model's vocabulary.
The minimum number of times a token must appear to be included in the word2vec model's vocabulary. Default: 5
-
val
numPartitions: IntParam
Number of partitions for sentences of words (Default:
1
). -
val
seed: IntParam
Random seed for shuffling the dataset (Default:
44
) -
val
stepSize: DoubleParam
Param for Step size to be used for each iteration of optimization (> 0) (Default:
0.025
). -
val
storageRef: Param[String]
Unique identifier for storage (Default:
this.uid
)Unique identifier for storage (Default:
this.uid
)- Definition Classes
- HasStorageRef
-
val
vectorSize: ProtectedParam[Int]
The dimension of the code that you want to transform from words (Default:
100
). -
val
windowSize: IntParam
The window size (context words from [-window, window]) (Default:
5
)
Annotator types
Required input and expected output annotator types
-
val
inputAnnotatorTypes: Array[AnnotatorType]
Input Annotator Types: TOKEN
Input Annotator Types: TOKEN
- Definition Classes
- Doc2VecApproach → HasInputAnnotationCols
-
val
outputAnnotatorType: String
Output Annotator Types: SENTENCE_EMBEDDINGS
Output Annotator Types: SENTENCE_EMBEDDINGS
- Definition Classes
- Doc2VecApproach → HasOutputAnnotatorType
Members
-
implicit
class
ProtectedParam[T] extends Param[T]
- Definition Classes
- HasProtectedParams
-
type
AnnotatorType = String
- Definition Classes
- HasOutputAnnotatorType
-
def
beforeTraining(spark: SparkSession): Unit
- Definition Classes
- Doc2VecApproach → AnnotatorApproach
-
final
def
clear(param: Param[_]): Doc2VecApproach.this.type
- Definition Classes
- Params
-
final
def
copy(extra: ParamMap): Estimator[Doc2VecModel]
- Definition Classes
- AnnotatorApproach → Estimator → PipelineStage → Params
-
def
createDatabaseConnection(database: Name): RocksDBConnection
- Definition Classes
- HasStorageRef
-
val
description: String
- Definition Classes
- Doc2VecApproach → AnnotatorApproach
-
def
explainParam(param: Param[_]): String
- Definition Classes
- Params
-
def
explainParams(): String
- Definition Classes
- Params
-
final
def
extractParamMap(): ParamMap
- Definition Classes
- Params
-
final
def
extractParamMap(extra: ParamMap): ParamMap
- Definition Classes
- Params
-
val
features: ArrayBuffer[Feature[_, _, _]]
- Definition Classes
- HasFeatures
-
final
def
fit(dataset: Dataset[_]): Doc2VecModel
- Definition Classes
- AnnotatorApproach → Estimator
-
def
fit(dataset: Dataset[_], paramMaps: Seq[ParamMap]): Seq[Doc2VecModel]
- Definition Classes
- Estimator
- Annotations
- @Since( "2.0.0" )
-
def
fit(dataset: Dataset[_], paramMap: ParamMap): Doc2VecModel
- Definition Classes
- Estimator
- Annotations
- @Since( "2.0.0" )
-
def
fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): Doc2VecModel
- Definition Classes
- Estimator
- Annotations
- @Since( "2.0.0" ) @varargs()
-
final
def
get[T](param: Param[T]): Option[T]
- Definition Classes
- Params
-
final
def
getDefault[T](param: Param[T]): Option[T]
- Definition Classes
- Params
-
def
getInputCols: Array[String]
- returns
input annotations columns currently used
- Definition Classes
- HasInputAnnotationCols
-
def
getLazyAnnotator: Boolean
- Definition Classes
- CanBeLazy
-
final
def
getOrDefault[T](param: Param[T]): T
- Definition Classes
- Params
-
final
def
getOutputCol: String
Gets annotation column name going to generate
Gets annotation column name going to generate
- Definition Classes
- HasOutputAnnotationCol
-
def
getParam(paramName: String): Param[Any]
- Definition Classes
- Params
-
def
getStorageRef: String
- Definition Classes
- HasStorageRef
-
final
def
hasDefault[T](param: Param[T]): Boolean
- Definition Classes
- Params
-
def
hasParam(paramName: String): Boolean
- Definition Classes
- Params
-
final
def
isDefined(param: Param[_]): Boolean
- Definition Classes
- Params
-
final
def
isSet(param: Param[_]): Boolean
- Definition Classes
- Params
-
val
lazyAnnotator: BooleanParam
- Definition Classes
- CanBeLazy
-
def
onTrained(model: Doc2VecModel, spark: SparkSession): Unit
- Definition Classes
- AnnotatorApproach
-
val
optionalInputAnnotatorTypes: Array[String]
- Definition Classes
- HasInputAnnotationCols
-
lazy val
params: Array[Param[_]]
- Definition Classes
- Params
-
def
save(path: String): Unit
- Definition Classes
- MLWritable
- Annotations
- @Since( "1.6.0" ) @throws( ... )
-
def
set[T](param: ProtectedParam[T], value: T): Doc2VecApproach.this.type
Sets the value for a protected Param.
Sets the value for a protected Param.
If the parameter was already set, it will not be set again. Default values do not count as a set value and can be overridden.
- T
Type of the parameter
- param
Protected parameter to set
- value
Value for the parameter
- returns
This object
- Definition Classes
- HasProtectedParams
-
final
def
set[T](param: Param[T], value: T): Doc2VecApproach.this.type
- Definition Classes
- Params
-
final
def
setInputCols(value: String*): Doc2VecApproach.this.type
- Definition Classes
- HasInputAnnotationCols
-
def
setInputCols(value: Array[String]): Doc2VecApproach.this.type
Overrides required annotators column if different than default
Overrides required annotators column if different than default
- Definition Classes
- HasInputAnnotationCols
-
def
setLazyAnnotator(value: Boolean): Doc2VecApproach.this.type
- Definition Classes
- CanBeLazy
-
final
def
setOutputCol(value: String): Doc2VecApproach.this.type
Overrides annotation column name when transforming
Overrides annotation column name when transforming
- Definition Classes
- HasOutputAnnotationCol
-
def
setStorageRef(value: String): Doc2VecApproach.this.type
- Definition Classes
- HasStorageRef
-
def
toString(): String
- Definition Classes
- Identifiable → AnyRef → Any
-
def
train(dataset: Dataset[_], recursivePipeline: Option[PipelineModel]): Doc2VecModel
- Definition Classes
- Doc2VecApproach → AnnotatorApproach
-
final
def
transformSchema(schema: StructType): StructType
requirement for pipeline transformation validation.
requirement for pipeline transformation validation. It is called on fit()
- Definition Classes
- AnnotatorApproach → PipelineStage
-
val
uid: String
- Definition Classes
- Doc2VecApproach → Identifiable
-
def
validateStorageRef(dataset: Dataset[_], inputCols: Array[String], annotatorType: String): Unit
- Definition Classes
- HasStorageRef
-
def
write: MLWriter
- Definition Classes
- ParamsAndFeaturesWritable → DefaultParamsWritable → MLWritable
Parameter setters
-
def
setEnableCaching(value: Boolean): Doc2VecApproach.this.type
- Definition Classes
- HasEnableCachingProperties
- def setMaxIter(value: Int): Doc2VecApproach.this.type
- def setMaxSentenceLength(value: Int): Doc2VecApproach.this.type
- def setMinCount(value: Int): Doc2VecApproach.this.type
- def setNumPartitions(value: Int): Doc2VecApproach.this.type
- def setSeed(value: Int): Doc2VecApproach.this.type
- def setStepSize(value: Double): Doc2VecApproach.this.type
- def setVectorSize(value: Int): Doc2VecApproach.this.type
- def setWindowSize(value: Int): Doc2VecApproach.this.type