BGE (BAAI) - Chinese Embedding Model
Basic Information
- Product Name: BGE (BAAI General Embedding)
- Developer: BAAI (Beijing Academy of Artificial Intelligence)
- Country/Region: China (Beijing)
- Official Website: https://bge.baai.ac.cn/
- GitHub: https://github.com/FlagOpen/FlagEmbedding
- Hugging Face: https://huggingface.co/BAAI
- Type: Open-source embedding model series
- License: MIT
- Latest Flagship: BGE-M3
Product Description
BGE (BAAI General Embedding) is an open-source embedding model series developed by the Beijing Academy of Artificial Intelligence, leading the field of Chinese embedding models. Its flagship product, BGE-M3, is the first embedding model to simultaneously support three retrieval functions (dense, multi-vector, and sparse retrieval), supports 100+ languages, and allows input lengths of up to 8192 tokens. It has achieved state-of-the-art results in multilingual (MIRACL) and cross-lingual (MKQA) benchmarks.
Core Features
- BGE-M3 (Flagship Model):
- Three-in-one retrieval: dense retrieval + multi-vector retrieval + sparse retrieval
- Support for 100+ languages
- Maximum input length of 8192 tokens
- Based on XLM-RoBERTa architecture
- SOTA in MIRACL and MKQA benchmarks
- Trained via self-knowledge distillation
- BGE-VL (March 2025): Multimodal embedding model supporting visual search
- BGE Series: bge-large-zh, bge-base-zh, bge-small-zh, and other Chinese-specific models
- FlagEmbedding Toolkit: Comprehensive retrieval and retrieval-augmented LLM tools
Business Model
- Completely Open Source and Free: MIT license
- Multi-platform Availability: Hugging Face, Ollama, NVIDIA NIM, DeepInfra, etc.
- No API Fees: Can be run locally or on your own infrastructure for free
Target Users
- Developers of Chinese NLP and RAG applications
- Builders of multilingual search systems
- Teams requiring locally deployed embedding models
- Academic researchers and open-source communities
- Technical teams pursuing hybrid retrieval (dense + sparse)
Competitive Advantages
- Industry-leading Chinese embedding quality
- Unique three-in-one retrieval capability (dense + multi-vector + sparse)
- Support for 100+ languages, strong multilingual and cross-lingual capabilities
- Completely open source and free, with an active community
- Supports ultra-long inputs (8192 tokens)
- Easy local deployment via Ollama
Model Matrix
| Model | Parameters | Dimensions | Language | Features |
|---|---|---|---|---|
| BGE-M3 | ~568M | Variable | 100+ | Three-in-one retrieval flagship |
| BGE-VL | - | - | Multilingual | Multimodal visual search |
| bge-large-zh | 326M | 1024 | Chinese | Large Chinese-specific model |
| bge-base-zh | 102M | 768 | Chinese | Balanced Chinese model |
| bge-small-zh | 24M | 512 | Chinese | Lightweight Chinese model |
Relationship with the OpenClaw Ecosystem
BGE is the preferred embedding model for Chinese RAG capabilities within the OpenClaw ecosystem. For Chinese users, BGE-M3's Chinese embedding quality and support for 100+ languages enable OpenClaw agents to handle Chinese document retrieval with high quality. BGE's completely open-source and local deployment capabilities also align with OpenClaw's privacy-focused design philosophy. The three-in-one retrieval capability provides OpenClaw with more flexible retrieval strategy options.
External References
Learn more from these authoritative sources: