class Word2VecApproach extends AnnotatorApproach[Word2VecModel] with HasStorageRef with HasEnableCachingProperties with HasProtectedParams
Trains a Word2Vec model that creates vector representations of words in a text corpus.
The algorithm first constructs a vocabulary from the corpus and then learns vector representation of words in the vocabulary. The vector representation can be used as features in natural language processing and machine learning algorithms.
We use Word2Vec implemented in Spark ML. It uses skip-gram model in our implementation and a hierarchical softmax method to train the model. The variable names in the implementation match the original C implementation.
For instantiated/pretrained models, see Word2VecModel.
Sources :
For the original C implementation, see https://code.google.com/p/word2vec/
For the research paper, see Efficient Estimation of Word Representations in Vector Space and Distributed Representations of Words and Phrases and their Compositionality.
Example
import spark.implicits._ import com.johnsnowlabs.nlp.annotator.{Tokenizer, Word2VecApproach} import com.johnsnowlabs.nlp.base.DocumentAssembler import org.apache.spark.ml.Pipeline val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = new Word2VecApproach() .setInputCols("token") .setOutputCol("embeddings") val pipeline = new Pipeline() .setStages(Array( documentAssembler, tokenizer, embeddings )) val path = "src/test/resources/spell/sherlockholmes.txt" val dataset = spark.sparkContext.textFile(path) .toDF("text") val pipelineModel = pipeline.fit(dataset)
- Grouped
- Alphabetic
- By Inheritance
- Word2VecApproach
- HasProtectedParams
- HasEnableCachingProperties
- HasStorageRef
- ParamsAndFeaturesWritable
- HasFeatures
- AnnotatorApproach
- CanBeLazy
- DefaultParamsWritable
- MLWritable
- HasOutputAnnotatorType
- HasOutputAnnotationCol
- HasInputAnnotationCols
- Estimator
- PipelineStage
- Logging
- Params
- Serializable
- Serializable
- Identifiable
- AnyRef
- Any
- Hide All
- Show All
- Public
- All
Type Members
-
implicit
class
ProtectedParam[T] extends Param[T]
- Definition Classes
- HasProtectedParams
-
type
AnnotatorType = String
- Definition Classes
- HasOutputAnnotatorType
Value Members
-
final
def
!=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
##(): Int
- Definition Classes
- AnyRef → Any
-
final
def
$[T](param: Param[T]): T
- Attributes
- protected
- Definition Classes
- Params
-
def
$$[T](feature: StructFeature[T]): T
- Attributes
- protected
- Definition Classes
- HasFeatures
-
def
$$[K, V](feature: MapFeature[K, V]): Map[K, V]
- Attributes
- protected
- Definition Classes
- HasFeatures
-
def
$$[T](feature: SetFeature[T]): Set[T]
- Attributes
- protected
- Definition Classes
- HasFeatures
-
def
$$[T](feature: ArrayFeature[T]): Array[T]
- Attributes
- protected
- Definition Classes
- HasFeatures
-
final
def
==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
def
_fit(dataset: Dataset[_], recursiveStages: Option[PipelineModel]): Word2VecModel
- Attributes
- protected
- Definition Classes
- AnnotatorApproach
-
final
def
asInstanceOf[T0]: T0
- Definition Classes
- Any
-
def
beforeTraining(spark: SparkSession): Unit
- Definition Classes
- Word2VecApproach → AnnotatorApproach
-
final
def
checkSchema(schema: StructType, inputAnnotatorType: String): Boolean
- Attributes
- protected
- Definition Classes
- HasInputAnnotationCols
-
final
def
clear(param: Param[_]): Word2VecApproach.this.type
- Definition Classes
- Params
-
def
clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()
-
final
def
copy(extra: ParamMap): Estimator[Word2VecModel]
- Definition Classes
- AnnotatorApproach → Estimator → PipelineStage → Params
-
def
copyValues[T <: Params](to: T, extra: ParamMap): T
- Attributes
- protected
- Definition Classes
- Params
-
def
createDatabaseConnection(database: Name): RocksDBConnection
- Definition Classes
- HasStorageRef
-
final
def
defaultCopy[T <: Params](extra: ParamMap): T
- Attributes
- protected
- Definition Classes
- Params
-
val
description: String
- Definition Classes
- Word2VecApproach → AnnotatorApproach
-
val
enableCaching: BooleanParam
Whether to enable caching DataFrames or RDDs during the training
Whether to enable caching DataFrames or RDDs during the training
- Definition Classes
- HasEnableCachingProperties
-
final
def
eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
def
equals(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
def
explainParam(param: Param[_]): String
- Definition Classes
- Params
-
def
explainParams(): String
- Definition Classes
- Params
-
final
def
extractParamMap(): ParamMap
- Definition Classes
- Params
-
final
def
extractParamMap(extra: ParamMap): ParamMap
- Definition Classes
- Params
-
val
features: ArrayBuffer[Feature[_, _, _]]
- Definition Classes
- HasFeatures
-
def
finalize(): Unit
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( classOf[java.lang.Throwable] )
-
final
def
fit(dataset: Dataset[_]): Word2VecModel
- Definition Classes
- AnnotatorApproach → Estimator
-
def
fit(dataset: Dataset[_], paramMaps: Seq[ParamMap]): Seq[Word2VecModel]
- Definition Classes
- Estimator
- Annotations
- @Since( "2.0.0" )
-
def
fit(dataset: Dataset[_], paramMap: ParamMap): Word2VecModel
- Definition Classes
- Estimator
- Annotations
- @Since( "2.0.0" )
-
def
fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): Word2VecModel
- Definition Classes
- Estimator
- Annotations
- @Since( "2.0.0" ) @varargs()
-
def
get[T](feature: StructFeature[T]): Option[T]
- Attributes
- protected
- Definition Classes
- HasFeatures
-
def
get[K, V](feature: MapFeature[K, V]): Option[Map[K, V]]
- Attributes
- protected
- Definition Classes
- HasFeatures
-
def
get[T](feature: SetFeature[T]): Option[Set[T]]
- Attributes
- protected
- Definition Classes
- HasFeatures
-
def
get[T](feature: ArrayFeature[T]): Option[Array[T]]
- Attributes
- protected
- Definition Classes
- HasFeatures
-
final
def
get[T](param: Param[T]): Option[T]
- Definition Classes
- Params
-
final
def
getClass(): Class[_]
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
final
def
getDefault[T](param: Param[T]): Option[T]
- Definition Classes
- Params
-
def
getEnableCaching: Boolean
- Definition Classes
- HasEnableCachingProperties
-
def
getInputCols: Array[String]
- returns
input annotations columns currently used
- Definition Classes
- HasInputAnnotationCols
-
def
getLazyAnnotator: Boolean
- Definition Classes
- CanBeLazy
- def getMaxIter: Int
- def getMaxSentenceLength: Int
- def getMinCount: Int
- def getNumPartitions: Int
-
final
def
getOrDefault[T](param: Param[T]): T
- Definition Classes
- Params
-
final
def
getOutputCol: String
Gets annotation column name going to generate
Gets annotation column name going to generate
- Definition Classes
- HasOutputAnnotationCol
-
def
getParam(paramName: String): Param[Any]
- Definition Classes
- Params
- def getSeed: Int
- def getStepSize: Double
-
def
getStorageRef: String
- Definition Classes
- HasStorageRef
- def getVectorSize: Int
- def getWindowSize: Int
-
final
def
hasDefault[T](param: Param[T]): Boolean
- Definition Classes
- Params
-
def
hasParam(paramName: String): Boolean
- Definition Classes
- Params
-
def
hashCode(): Int
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
def
initializeLogIfNecessary(isInterpreter: Boolean, silent: Boolean): Boolean
- Attributes
- protected
- Definition Classes
- Logging
-
def
initializeLogIfNecessary(isInterpreter: Boolean): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
val
inputAnnotatorTypes: Array[AnnotatorType]
Input Annotator Types: TOKEN
Input Annotator Types: TOKEN
- Definition Classes
- Word2VecApproach → HasInputAnnotationCols
-
final
val
inputCols: StringArrayParam
columns that contain annotations necessary to run this annotator AnnotatorType is used both as input and output columns if not specified
columns that contain annotations necessary to run this annotator AnnotatorType is used both as input and output columns if not specified
- Attributes
- protected
- Definition Classes
- HasInputAnnotationCols
-
final
def
isDefined(param: Param[_]): Boolean
- Definition Classes
- Params
-
final
def
isInstanceOf[T0]: Boolean
- Definition Classes
- Any
-
final
def
isSet(param: Param[_]): Boolean
- Definition Classes
- Params
-
def
isTraceEnabled(): Boolean
- Attributes
- protected
- Definition Classes
- Logging
-
val
lazyAnnotator: BooleanParam
- Definition Classes
- CanBeLazy
-
def
log: Logger
- Attributes
- protected
- Definition Classes
- Logging
-
def
logDebug(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logDebug(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logError(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logError(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logInfo(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logInfo(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logName: String
- Attributes
- protected
- Definition Classes
- Logging
-
def
logTrace(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logTrace(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logWarning(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logWarning(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
val
maxIter: IntParam
Param for maximum number of iterations (>= 0) (Default:
1
) -
val
maxSentenceLength: IntParam
Sets the maximum length (in words) of each sentence in the input data (Default:
1000
).Sets the maximum length (in words) of each sentence in the input data (Default:
1000
). Any sentence longer than this threshold will be divided into chunks of up tomaxSentenceLength
size. -
val
minCount: IntParam
The minimum number of times a token must appear to be included in the word2vec model's vocabulary (Default:
5
). -
def
msgHelper(schema: StructType): String
- Attributes
- protected
- Definition Classes
- HasInputAnnotationCols
-
final
def
ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
final
def
notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
final
def
notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
val
numPartitions: IntParam
Number of partitions for sentences of words (Default:
1
). -
def
onTrained(model: Word2VecModel, spark: SparkSession): Unit
- Definition Classes
- AnnotatorApproach
-
def
onWrite(path: String, spark: SparkSession): Unit
- Attributes
- protected
- Definition Classes
- ParamsAndFeaturesWritable
-
val
optionalInputAnnotatorTypes: Array[String]
- Definition Classes
- HasInputAnnotationCols
-
val
outputAnnotatorType: String
Output Annotator Types: WORD_EMBEDDINGS
Output Annotator Types: WORD_EMBEDDINGS
- Definition Classes
- Word2VecApproach → HasOutputAnnotatorType
-
final
val
outputCol: Param[String]
- Attributes
- protected
- Definition Classes
- HasOutputAnnotationCol
-
lazy val
params: Array[Param[_]]
- Definition Classes
- Params
-
def
save(path: String): Unit
- Definition Classes
- MLWritable
- Annotations
- @Since( "1.6.0" ) @throws( ... )
-
val
seed: IntParam
Random seed for shuffling the dataset (Default:
44
) -
def
set[T](param: ProtectedParam[T], value: T): Word2VecApproach.this.type
Sets the value for a protected Param.
Sets the value for a protected Param.
If the parameter was already set, it will not be set again. Default values do not count as a set value and can be overridden.
- T
Type of the parameter
- param
Protected parameter to set
- value
Value for the parameter
- returns
This object
- Definition Classes
- HasProtectedParams
-
def
set[T](feature: StructFeature[T], value: T): Word2VecApproach.this.type
- Attributes
- protected
- Definition Classes
- HasFeatures
-
def
set[K, V](feature: MapFeature[K, V], value: Map[K, V]): Word2VecApproach.this.type
- Attributes
- protected
- Definition Classes
- HasFeatures
-
def
set[T](feature: SetFeature[T], value: Set[T]): Word2VecApproach.this.type
- Attributes
- protected
- Definition Classes
- HasFeatures
-
def
set[T](feature: ArrayFeature[T], value: Array[T]): Word2VecApproach.this.type
- Attributes
- protected
- Definition Classes
- HasFeatures
-
final
def
set(paramPair: ParamPair[_]): Word2VecApproach.this.type
- Attributes
- protected
- Definition Classes
- Params
-
final
def
set(param: String, value: Any): Word2VecApproach.this.type
- Attributes
- protected
- Definition Classes
- Params
-
final
def
set[T](param: Param[T], value: T): Word2VecApproach.this.type
- Definition Classes
- Params
-
def
setDefault[T](feature: StructFeature[T], value: () ⇒ T): Word2VecApproach.this.type
- Attributes
- protected
- Definition Classes
- HasFeatures
-
def
setDefault[K, V](feature: MapFeature[K, V], value: () ⇒ Map[K, V]): Word2VecApproach.this.type
- Attributes
- protected
- Definition Classes
- HasFeatures
-
def
setDefault[T](feature: SetFeature[T], value: () ⇒ Set[T]): Word2VecApproach.this.type
- Attributes
- protected
- Definition Classes
- HasFeatures
-
def
setDefault[T](feature: ArrayFeature[T], value: () ⇒ Array[T]): Word2VecApproach.this.type
- Attributes
- protected
- Definition Classes
- HasFeatures
-
final
def
setDefault(paramPairs: ParamPair[_]*): Word2VecApproach.this.type
- Attributes
- protected
- Definition Classes
- Params
-
final
def
setDefault[T](param: Param[T], value: T): Word2VecApproach.this.type
- Attributes
- protected[org.apache.spark.ml]
- Definition Classes
- Params
-
def
setEnableCaching(value: Boolean): Word2VecApproach.this.type
- Definition Classes
- HasEnableCachingProperties
-
final
def
setInputCols(value: String*): Word2VecApproach.this.type
- Definition Classes
- HasInputAnnotationCols
-
def
setInputCols(value: Array[String]): Word2VecApproach.this.type
Overrides required annotators column if different than default
Overrides required annotators column if different than default
- Definition Classes
- HasInputAnnotationCols
-
def
setLazyAnnotator(value: Boolean): Word2VecApproach.this.type
- Definition Classes
- CanBeLazy
- def setMaxIter(value: Int): Word2VecApproach.this.type
- def setMaxSentenceLength(value: Int): Word2VecApproach.this.type
- def setMinCount(value: Int): Word2VecApproach.this.type
- def setNumPartitions(value: Int): Word2VecApproach.this.type
-
final
def
setOutputCol(value: String): Word2VecApproach.this.type
Overrides annotation column name when transforming
Overrides annotation column name when transforming
- Definition Classes
- HasOutputAnnotationCol
- def setSeed(value: Int): Word2VecApproach.this.type
- def setStepSize(value: Double): Word2VecApproach.this.type
-
def
setStorageRef(value: String): Word2VecApproach.this.type
- Definition Classes
- HasStorageRef
- def setVectorSize(value: Int): Word2VecApproach.this.type
- def setWindowSize(value: Int): Word2VecApproach.this.type
-
val
stepSize: DoubleParam
Param for Step size to be used for each iteration of optimization (> 0) (Default:
0.025
). -
val
storageRef: Param[String]
Unique identifier for storage (Default:
this.uid
)Unique identifier for storage (Default:
this.uid
)- Definition Classes
- HasStorageRef
-
final
def
synchronized[T0](arg0: ⇒ T0): T0
- Definition Classes
- AnyRef
-
def
toString(): String
- Definition Classes
- Identifiable → AnyRef → Any
-
def
train(dataset: Dataset[_], recursivePipeline: Option[PipelineModel]): Word2VecModel
- Definition Classes
- Word2VecApproach → AnnotatorApproach
-
final
def
transformSchema(schema: StructType): StructType
requirement for pipeline transformation validation.
requirement for pipeline transformation validation. It is called on fit()
- Definition Classes
- AnnotatorApproach → PipelineStage
-
def
transformSchema(schema: StructType, logging: Boolean): StructType
- Attributes
- protected
- Definition Classes
- PipelineStage
- Annotations
- @DeveloperApi()
-
val
uid: String
- Definition Classes
- Word2VecApproach → Identifiable
-
def
validate(schema: StructType): Boolean
takes a Dataset and checks to see if all the required annotation types are present.
takes a Dataset and checks to see if all the required annotation types are present.
- schema
to be validated
- returns
True if all the required types are present, else false
- Attributes
- protected
- Definition Classes
- AnnotatorApproach
-
def
validateStorageRef(dataset: Dataset[_], inputCols: Array[String], annotatorType: String): Unit
- Definition Classes
- HasStorageRef
-
val
vectorSize: ProtectedParam[Int]
The dimension of the code that you want to transform from words (Default:
100
). -
final
def
wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()
-
val
windowSize: IntParam
The window size (context words from [-window, window]) (Default:
5
) -
def
write: MLWriter
- Definition Classes
- ParamsAndFeaturesWritable → DefaultParamsWritable → MLWritable
Inherited from HasProtectedParams
Inherited from HasEnableCachingProperties
Inherited from HasStorageRef
Inherited from ParamsAndFeaturesWritable
Inherited from HasFeatures
Inherited from AnnotatorApproach[Word2VecModel]
Inherited from CanBeLazy
Inherited from DefaultParamsWritable
Inherited from MLWritable
Inherited from HasOutputAnnotatorType
Inherited from HasOutputAnnotationCol
Inherited from HasInputAnnotationCols
Inherited from Estimator[Word2VecModel]
Inherited from PipelineStage
Inherited from Logging
Inherited from Params
Inherited from Serializable
Inherited from Serializable
Inherited from Identifiable
Inherited from AnyRef
Inherited from Any
Parameters
A list of (hyper-)parameter keys this annotator can take. Users can set and get the parameter values through setters and getters, respectively.
Annotator types
Required input and expected output annotator types