Description
FlagEmbedding can map any text to a low-dimensional dense vector which can be used for tasks like retrieval, classification, clustering, or semantic search. And it also can be used in vector database for LLMs.
bge
is short for BAAI general embedding
.
Model | Language | Description | query instruction for retrieval* |
---|---|---|---|
BAAI/bge-large-en | English | rank 1st in MTEB leaderboard | Represent this sentence for searching relevant passages: |
BAAI/bge-base-en | English | rank 2nd in MTEB leaderboard | Represent this sentence for searching relevant passages: |
BAAI/bge-small-en | English | a small-scale model but with competitive performance | Represent this sentence for searching relevant passages: |
BAAI/bge-large-zh | Chinese | rank 1st in C-MTEB benchmark | 为这个句子生成表示以用于检索相关文章: |
BAAI/bge-large-zh-noinstruct | Chinese | This model is trained without instruction, and rank 2nd in C-MTEB benchmark | |
BAAI/bge-base-zh | Chinese | a base-scale model but has similar ability with bge-large-zh |
为这个句子生成表示以用于检索相关文章: |
BAAI/bge-small-zh | Chinese | a small-scale model but with competitive performance | 为这个句子生成表示以用于检索相关文章: |
How to use
document = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
tokenizer = Tokenizer()\
.setInputCols(["document"])\
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bge_large", "en")\
.setInputCols(["document", "token"])\
.setOutputCol("embeddings")
val document = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bge_large", "en")
.setInputCols("document", "token")
.setOutputCol("embeddings")
Model Information
Model Name: | bge_large |
Compatibility: | Spark NLP 5.0.2+ |
License: | Open Source |
Edition: | Official |
Input Labels: | [document, token] |
Output Labels: | [embeddings] |
Language: | en |
Size: | 794.2 MB |
Case sensitive: | true |
References
BAAI models are from BAAI