BGE (BAAI) - Chinese Embedding Model

Open-source embedding model series B AI Processing & RAG

Basic Information

Product Name: BGE (BAAI General Embedding)
Developer: BAAI (Beijing Academy of Artificial Intelligence)
Country/Region: China (Beijing)
Official Website: https://bge.baai.ac.cn/
GitHub: https://github.com/FlagOpen/FlagEmbedding
Hugging Face: https://huggingface.co/BAAI
Type: Open-source embedding model series
License: MIT
Latest Flagship: BGE-M3

Product Description

BGE (BAAI General Embedding) is an open-source embedding model series developed by the Beijing Academy of Artificial Intelligence, leading the field of Chinese embedding models. Its flagship product, BGE-M3, is the first embedding model to simultaneously support three retrieval functions (dense, multi-vector, and sparse retrieval), supports 100+ languages, and allows input lengths of up to 8192 tokens. It has achieved state-of-the-art results in multilingual (MIRACL) and cross-lingual (MKQA) benchmarks.

Core Features

BGE-M3 (Flagship Model):
Three-in-one retrieval: dense retrieval + multi-vector retrieval + sparse retrieval
Support for 100+ languages
Maximum input length of 8192 tokens
Based on XLM-RoBERTa architecture
SOTA in MIRACL and MKQA benchmarks
Trained via self-knowledge distillation
BGE-VL (March 2025): Multimodal embedding model supporting visual search
BGE Series: bge-large-zh, bge-base-zh, bge-small-zh, and other Chinese-specific models
FlagEmbedding Toolkit: Comprehensive retrieval and retrieval-augmented LLM tools

Business Model

Completely Open Source and Free: MIT license
Multi-platform Availability: Hugging Face, Ollama, NVIDIA NIM, DeepInfra, etc.
No API Fees: Can be run locally or on your own infrastructure for free

Target Users

Developers of Chinese NLP and RAG applications
Builders of multilingual search systems
Teams requiring locally deployed embedding models
Academic researchers and open-source communities
Technical teams pursuing hybrid retrieval (dense + sparse)

Competitive Advantages

Industry-leading Chinese embedding quality
Unique three-in-one retrieval capability (dense + multi-vector + sparse)
Support for 100+ languages, strong multilingual and cross-lingual capabilities
Completely open source and free, with an active community
Supports ultra-long inputs (8192 tokens)
Easy local deployment via Ollama

Model Matrix

Model	Parameters	Dimensions	Language	Features
BGE-M3	~568M	Variable	100+	Three-in-one retrieval flagship
BGE-VL	-	-	Multilingual	Multimodal visual search
bge-large-zh	326M	1024	Chinese	Large Chinese-specific model
bge-base-zh	102M	768	Chinese	Balanced Chinese model
bge-small-zh	24M	512	Chinese	Lightweight Chinese model

Relationship with the OpenClaw Ecosystem

BGE is the preferred embedding model for Chinese RAG capabilities within the OpenClaw ecosystem. For Chinese users, BGE-M3's Chinese embedding quality and support for 100+ languages enable OpenClaw agents to handle Chinese document retrieval with high quality. BGE's completely open-source and local deployment capabilities also align with OpenClaw's privacy-focused design philosophy. The three-in-one retrieval capability provides OpenClaw with more flexible retrieval strategy options.

External References

Learn more from these authoritative sources:

Categories

Top Skills

Topics A-I

Topics L-W

Popular Articles