Description
This model is trained using Word2Vec approach on a corpora of 140 Million tokens, has a vocabulary of 100k unique tokens, and gives 300 dimensional vector outputs per token. The output vectors map words into a meaningful space where the distance between the vectors is related to semantic similarity of words.
These embeddings can be used in multiple tasks like semantic word similarity, named entity recognition, sentiment analysis, and classification.
How to use
...
embeddings = WordEmbeddingsModel.pretrained("urduvec_140M_300d", "ur") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['مجھے سپارک این ایل پی پسند ہے۔']], ["text"]))
val embeddings = WordEmbeddingsModel.pretrained("urduvec_140M_300d", "ur")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
val data = Seq("مجھے سپارک این ایل پی پسند ہے۔").toDF("text")
val result = pipeline.fit(data).transform(data)
import nlu
text = ["مجھے سپارک این ایل پی پسند ہے۔"]
urduvec_df = nlu.load('ur.embed.urdu_vec_140M_300d').predict(text, output_level="token")
urduvec_df
Results
The model gives 300 dimensional Word2Vec feature vector outputs per token.
|Embeddings vector | Tokens
|----------------------------------------------------|---------
| [0.15994004905223846, -0.2213257998228073, 0.0... | مجھے
| [-0.16085924208164215, -0.12259697169065475, -... | سپارک
| [-0.07977486401796341, -0.528775691986084, 0.3... | این
| [-0.24136857688426971, -0.15272589027881622, 0... | ایل
| [-0.23666366934776306, -0.16016320884227753, 0... | پی
| [0.07911433279514313, 0.05598200485110283, 0.0... | پسند
Model Information
Model Name: | urduvec_140M_300d |
Type: | embeddings |
Compatibility: | Spark NLP 2.7.0+ |
License: | Open Source |
Edition: | Official |
Input Labels: | [document, token] |
Output Labels: | [word_embeddings] |
Language: | ur |
Case sensitive: | false |
Dimension: | 300 |
Data Source
The model is imported from http://www.lrec-conf.org/proceedings/lrec2018/pdf/148.pdf