General Text Embeddings (GTE) English (gte_large)

Description

General Text Embeddings (GTE) model. Towards General Text Embeddings with Multi-stage Contrastive Learning

The GTE models are trained by Alibaba DAMO Academy. They are mainly based on the BERT framework and currently offer three different sizes of models, including GTE-large, GTE-base, and GTE-small. The GTE models are trained on a large-scale corpus of relevance text pairs, covering a wide range of domains and scenarios. This enables the GTE models to be applied to various downstream tasks of text embeddings, including information retrieval, semantic textual similarity, text reranking, etc.

Model Name Model Size (GB) Dimension Sequence Length Average (56) Clustering (11) Pair Classification (3) Reranking (4) Retrieval (15) STS (10) Summarization (1) Classification (12)
gte-large 0.67 1024 512 63.13 46.84 85.00 59.13 52.22 83.35 31.66 73.33
gte-base 0.22 768 512 62.39 46.2 84.57 58.61 51.14 82.3 31.17 73.01
e5-large-v2 1.34 1024 512 62.25 44.49 86.03 56.61 50.56 82.05 30.19 75.24
e5-base-v2 0.44 768 512 61.5 43.80 85.73 55.91 50.29 81.05 30.28 73.84
gte-small 0.07 384 512 61.36 44.89 83.54 57.7 49.46 82.07 30.42 72.31
text-embedding-ada-002 - 1536 8192 60.99 45.9 84.89 56.32 49.25 80.97 30.8 70.93
e5-small-v2 0.13 384 512 59.93 39.92 84.67 54.32 49.04 80.39 31.16 72.94
sentence-t5-xxl 9.73 768 512 59.51 43.72 85.06 56.42 42.24 82.63 30.08 73.42
all-mpnet-base-v2 0.44 768 514 57.78 43.69 83.04 59.36 43.81 80.28 27.49 65.07
sgpt-bloom-7b1-msmarco 28.27 4096 2048 57.59 38.93 81.9 55.65 48.22 77.74 33.6 66.19
all-MiniLM-L12-v2 0.13 384 512 56.53 41.81 82.41 58.44 42.69 79.8 27.9 63.21
all-MiniLM-L6-v2 0.09 384 512 56.26 42.35 82.37 58.04 41.95 78.9 30.81 63.05
contriever-base-msmarco 0.44 768 512 56.00 41.1 82.54 53.14 41.88 76.51 30.36 66.68
sentence-t5-base 0.22 768 512 55.27 40.21 85.18 53.09 33.63 81.14 31.39 69.81

Download Copy S3 URI

How to use

                
document = DocumentAssembler()\ 
    .setInputCol("text")\ 
    .setOutputCol("document")

tokenizer = Tokenizer()\ 
    .setInputCols(["document"])\ 
    .setOutputCol("token") 

embeddings = BertEmbeddings.pretrained("gte_large", "en")\ 
    .setInputCols(["document", "token"])\ 
    .setOutputCol("embeddings")


val document = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer() 
    .setInputCols("document") 
    .setOutputCol("token")
    
val embeddings = BertEmbeddings.pretrained("gte_large", "en")
    .setInputCols("document", "token")
    .setOutputCol("embeddings")

Model Information

Model Name: gte_large
Compatibility: Spark NLP 5.0.2+
License: Open Source
Edition: Official
Input Labels: [document, token]
Output Labels: [embeddings]
Language: en
Size: 794.2 MB
Case sensitive: true

References

GTE models are from Dingkun Long