sparknlp.annotator.embeddings.word_embeddings
#
Contains classes for WordEmbeddings.
Module Contents#
Classes#
Word Embeddings lookup annotator that maps tokens to vectors. |
|
Word Embeddings lookup annotator that maps tokens to vectors |
- class WordEmbeddings[source]#
Word Embeddings lookup annotator that maps tokens to vectors.
For instantiated/pretrained models, see
WordEmbeddingsModel
.A custom token lookup dictionary for embeddings can be set with
setStoragePath()
. Each line of the provided file needs to have a token, followed by their vector representation, delimited by a spaces:... are 0.39658191506190343 0.630968081620067 0.5393722253731201 0.8428180123359783 were 0.7535235923631415 0.9699218875629833 0.10397182122983872 0.11833962569383116 stress 0.0492683418305907 0.9415954572751959 0.47624463167525755 0.16790967216778263 induced 0.1535748762292387 0.33498936903209897 0.9235178224122094 0.1158772920395934 ...
If a token is not found in the dictionary, then the result will be a zero vector of the same dimension. Statistics about the rate of converted tokens, can be retrieved with
WordEmbeddingsModel.withCoverageColumn()
andWordEmbeddingsModel.overallCoverage()
.For extended examples of usage, see the Examples.
Input Annotation types
Output Annotation type
DOCUMENT, TOKEN
WORD_EMBEDDINGS
- Parameters:
- writeBufferSize
Buffer size limit before dumping to disk storage while writing, by default 10000
- readCacheSize
Cache size for items retrieved from storage. Increase for performance but higher memory consumption
See also
SentenceEmbeddings
to combine embeddings into a sentence-level representation
Examples
In this example, the file
random_embeddings_dim4.txt
has the form of the content above.>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> tokenizer = Tokenizer() \ ... .setInputCols(["document"]) \ ... .setOutputCol("token") >>> embeddings = WordEmbeddings() \ ... .setStoragePath("src/test/resources/random_embeddings_dim4.txt", ReadAs.TEXT) \ ... .setStorageRef("glove_4d") \ ... .setDimension(4) \ ... .setInputCols(["document", "token"]) \ ... .setOutputCol("embeddings") >>> embeddingsFinisher = EmbeddingsFinisher() \ ... .setInputCols(["embeddings"]) \ ... .setOutputCols("finished_embeddings") \ ... .setOutputAsVector(True) \ ... .setCleanAnnotations(False) >>> pipeline = Pipeline() \ ... .setStages([ ... documentAssembler, ... tokenizer, ... embeddings, ... embeddingsFinisher ... ]) >>> data = spark.createDataFrame([["The patient was diagnosed with diabetes."]]).toDF("text") >>> result = pipeline.fit(data).transform(data) >>> result.selectExpr("explode(finished_embeddings) as result").show(truncate=False) +----------------------------------------------------------------------------------+ |result | +----------------------------------------------------------------------------------+ |[0.9439099431037903,0.4707513153553009,0.806300163269043,0.16176554560661316] | |[0.7966810464859009,0.5551124811172485,0.8861005902290344,0.28284206986427307] | |[0.025029370561242104,0.35177749395370483,0.052506182342767715,0.1887107789516449]| |[0.08617766946554184,0.8399239182472229,0.5395117998123169,0.7864698767662048] | |[0.6599600911140442,0.16109347343444824,0.6041093468666077,0.8913561105728149] | |[0.5955275893211365,0.01899011991918087,0.4397728443145752,0.8911281824111938] | |[0.9840458631515503,0.7599489092826843,0.9417727589607239,0.8624503016471863] | +----------------------------------------------------------------------------------+
- class WordEmbeddingsModel(classname='com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel', java_model=None)[source]#
Word Embeddings lookup annotator that maps tokens to vectors
This is the instantiated model of
WordEmbeddings
.Pretrained models can be loaded with
pretrained()
of the companion object:>>> embeddings = WordEmbeddingsModel.pretrained() \ ... .setInputCols(["document", "token"]) \ ... .setOutputCol("embeddings")
The default model is
"glove_100d"
, if no name is provided. For available pretrained models please see the Models Hub.For extended examples of usage, see the Examples.
Input Annotation types
Output Annotation type
DOCUMENT, TOKEN
WORD_EMBEDDINGS
- Parameters:
- dimension
Number of embedding dimensions
- readCacheSize
Cache size for items retrieved from storage. Increase for performance but higher memory consumption
See also
SentenceEmbeddings
to combine embeddings into a sentence-level representation
Notes
There are also two convenient functions to retrieve the embeddings coverage with respect to the transformed dataset:
withCoverageColumn()
: Adds a custom column with word coverage stats for the embedded field. This creates a new column with statistics for each row.overallCoverage()
: Calculates overall word coverage for the whole data in the embedded field. This returns a single coverage object considering all rows in the field.
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> tokenizer = Tokenizer() \ ... .setInputCols(["document"]) \ ... .setOutputCol("token") >>> embeddings = WordEmbeddingsModel.pretrained() \ ... .setInputCols(["document", "token"]) \ ... .setOutputCol("embeddings") >>> embeddingsFinisher = EmbeddingsFinisher() \ ... .setInputCols(["embeddings"]) \ ... .setOutputCols("finished_embeddings") \ ... .setOutputAsVector(True) \ ... .setCleanAnnotations(False) >>> pipeline = Pipeline() \ ... .setStages([ ... documentAssembler, ... tokenizer, ... embeddings, ... embeddingsFinisher ... ]) >>> data = spark.createDataFrame([["This is a sentence."]]).toDF("text") >>> result = pipeline.fit(data).transform(data) >>> result.selectExpr("explode(finished_embeddings) as result").show(5, 80) +--------------------------------------------------------------------------------+ | result| +--------------------------------------------------------------------------------+ |[-0.570580005645752,0.44183000922203064,0.7010200023651123,-0.417129993438720...| |[-0.542639970779419,0.4147599935531616,1.0321999788284302,-0.4024400115013122...| |[-0.2708599865436554,0.04400600120425224,-0.020260000601410866,-0.17395000159...| |[0.6191999912261963,0.14650000631809235,-0.08592499792575836,-0.2629800140857...| |[-0.3397899866104126,0.20940999686717987,0.46347999572753906,-0.6479200124740...| +--------------------------------------------------------------------------------+
- setReadCacheSize(v)[source]#
Sets cache size for items retrieved from storage. Increase for performance but higher memory consumption.
- Parameters:
- vint
Cache size for items retrieved from storage
- static overallCoverage(dataset, embeddings_col)[source]#
Calculates overall word coverage for the whole data in the embedded field.
This returns a single coverage object considering all rows in the field.
- Parameters:
- dataset
pyspark.sql.DataFrame
The dataset with embeddings column
- embeddings_colstr
Name of the embeddings column
- dataset
- Returns:
CoverageResult
CoverateResult object with extracted information
Examples
>>> wordsOverallCoverage = WordEmbeddingsModel.overallCoverage( ... resultDF,"embeddings" ... ).percentage 1.0
- static withCoverageColumn(dataset, embeddings_col, output_col='coverage')[source]#
Adds a custom column with word coverage stats for the embedded field. This creates a new column with statistics for each row.
- Parameters:
- dataset
pyspark.sql.DataFrame
The dataset with embeddings column
- embeddings_colstr
Name of the embeddings column
- output_colstr, optional
Name for the resulting column, by default ‘coverage’
- dataset
- Returns:
pyspark.sql.DataFrame
Dataframe with calculated coverage
Examples
>>> wordsCoverage = WordEmbeddingsModel.withCoverageColumn(resultDF, "embeddings", "cov_embeddings") >>> wordsCoverage.select("text","cov_embeddings").show(truncate=False) +-------------------+--------------+ |text |cov_embeddings| +-------------------+--------------+ |This is a sentence.|[5, 5, 1.0] | +-------------------+--------------+
- static pretrained(name='glove_100d', lang='en', remote_loc=None)[source]#
Downloads and loads a pretrained model.
- Parameters:
- namestr, optional
Name of the pretrained model, by default “glove_100d”
- langstr, optional
Language of the pretrained model, by default “en”
- remote_locstr, optional
Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.
- Returns:
- WordEmbeddingsModel
The restored model
- static loadStorage(path, spark, storage_ref)[source]#
Loads the model from storage.
- Parameters:
- pathstr
Path to the model
- spark
pyspark.sql.SparkSession
The current SparkSession
- storage_refstr
Identifiers for the model parameters