Description
Language-agnostic BERT sentence embedding model supporting 109 languages:
The language-agnostic BERT sentence embedding encodes text into high dimensional vectors. The model is trained and optimized to produce similar representations exclusively for bilingual sentence pairs that are translations of each other. So it can be used for mining for translations of a sentence in a larger corpus.
The details are described in the paper “Language-agnostic BERT Sentence Embedding. July 2020”
How to use
...
embeddings = BertSentenceEmbeddings.pretrained("labse", "xx") \
.setInputCols("sentence") \
.setOutputCol("sentence_embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['I love NLP', 'Many thanks']], ["text"]))
...
val embeddings = BertSentenceEmbeddings.pretrained("labse", "xx")
.setInputCols("sentence")
.setOutputCol("sentence_embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings))
val data = Seq("I love NLP", "Many thanks").toDF("text")
val result = pipeline.fit(data).transform(data)
import nlu
text = ["I love NLP", "Many thanks"]
embeddings_df = nlu.load('xx.embed_sentence.labse').predict(text, output_level='sentence')
embeddings_df
Results
sentence xx_embed_sentence_labse_embeddings
0 I love NLP [-0.060951583087444305, -0.011645414866507053,...
1 Many thanks [0.002173778833821416, -0.05513454228639603, 0...
Model Information
Model Name: | labse |
Type: | embeddings |
Compatibility: | Spark NLP 2.6.0+ |
License: | Open Source |
Edition: | Official |
Input Labels: | [sentence] |
Output Labels: | [sentence_embeddings] |
Language: | [xx] |
Dimension: | 768 |
Case sensitive: | true |
Data Source
The model is imported from https://tfhub.dev/google/LaBSE/1