Description
This is a sentence-transformers model: It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for tasks like clustering or semantic search.
This model is intended to be used as a sentence and short paragraph encoder. Given an input text, it outputs a vector that captures the semantic information. The sentence vector may be used for information retrieval, clustering, or sentence similarity tasks.
By default, input text longer than 384 word pieces is truncated.
How to use
from sparknlp.base import DocumentAssembler
from sparknlp.annotator import MPNetEmbeddings
from pyspark.ml import Pipeline
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
mpnet_loaded = MPNetEmbeddings.load("all_mpnet_base_v2_openvino")\
.setInputCols(["document"])\
.setOutputCol("mpnet_embeddings")\
pipeline = Pipeline(
stages = [
document_assembler,
mpnet_loaded
])
data = spark.createDataFrame([
['William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist.']
]).toDF("text")
model = pipeline.fit(data)
result = model.transform(data)
result.selectExpr("explode(mpnet_embeddings.embeddings) as embeddings").show()
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.embeddings.MPNetEmbeddings
import org.apache.spark.ml.Pipeline
import org.apache.spark.sql.functions.explode
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val mpnetEmbeddings = MPNetEmbeddings.load("all_mpnet_base_v2_openvino")
.setInputCols("document")
.setOutputCol("mpnet_embeddings")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
mpnetEmbeddings
))
val data = Seq(
"William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist."
).toDF("text")
val model = pipeline.fit(data)
val result = model.transform(data)
result.select(explode($"mpnet_embeddings.embeddings").alias("embeddings")).show(false)
Results
+--------------------+
| embeddings|
+--------------------+
|[-0.020282388, 0....|
+--------------------+
Model Information
Model Name: | all_mpnet_base_v2_openvino |
Compatibility: | Spark NLP 6.0.0+ |
License: | Open Source |
Edition: | Official |
Input Labels: | [document] |
Output Labels: | [mpnet_embeddings] |
Language: | en |
Size: | 406.5 MB |