Nomic Embed - Open Source Embeddings
Basic Information
- Company/Brand: Nomic AI
- Country/Region: USA (New York)
- Official Website: https://www.nomic.ai
- GitHub: https://huggingface.co/nomic-ai
- Type: Fully Open Source Embedding Model
- Latest Version: nomic-embed-text-v2-moe
- Open Source License: Apache License 2.0
Product Description
Nomic Embed is a series of fully open-source embedding models launched by Nomic AI, centered around the core concept of "fully reproducible"—open-sourcing model weights, training data, and training code to ensure transparency and reproducibility in research. The latest Nomic Embed Text V2 is the world's first general-purpose text embedding model using the MoE (Mixture of Experts) architecture, trained on 1.6 billion contrastive pairs across approximately 100 languages.
Core Features/Characteristics
Nomic Embed Text V2 (MoE)
- MoE Architecture: First MoE general-purpose text embedding model
- Efficient Inference: Alternating MoE layers, top-2 routing with 8 experts, activating only 305M/475M parameters during inference
- Multilingual: Supports approximately 100 languages
- Matryoshka Representation: Supports dimensionality reduction from 768 to 256
- Strong Performance: Excels on BEIR and MIRACL benchmarks, comparable to models with twice the parameters
- Fully Open Source: Weights, training data, and training code are all publicly available
Nomic Embed Text V1
- First Fully Reproducible: The first fully reproducible open-source English embedding model with an 8192 context length
- Surpasses OpenAI: Outperformed Ada-002 and text-embedding-3-small at the time of release
- 8192 Context Length: Supports long documents
Technical Specifications
| Feature | V2 (MoE) | V1 |
|---|---|---|
| Parameters | 475M (305M activated) | 137M |
| Max Dimensions | 768 | 768 |
| Trimmable Dimensions | 256 | 256 |
| Context Length | 8192 | 8192 |
| Language Support | ~100 languages | English |
| Architecture | MoE | Dense |
Business Model
- Fully Free and Open Source: Apache License 2.0
- Nomic Atlas: Visualization and data exploration platform (commercial product)
- Ollama Integration: Can be run locally directly via Ollama
- Free Deployment: No commercial usage restrictions
Target Users
- Open-source community developers
- Researchers requiring full transparency and reproducibility
- Privacy-sensitive scenarios requiring local deployment
- Small teams with limited resources but in need of high-quality embeddings
- Developers of multilingual retrieval systems
Competitive Advantages
- The only fully open-source (weights + data + code) high-quality embedding model
- Innovative MoE architecture with high inference efficiency (only 64% parameters activated)
- Apache 2.0 license with no commercial restrictions
- Performance comparable to models with twice the parameters
- Supports one-click local running via Ollama
- Multilingual capabilities across approximately 100 languages
- Reproducible training process ensures transparency
Limitations
- Still lags behind commercial models (Cohere, OpenAI) on some benchmarks
- V1 only supports English
- Lacks commercial-grade API and technical support
- Smaller community compared to models like BGE
Relationship with the OpenClaw Ecosystem
Nomic Embed is an ideal choice for OpenClaw's local embeddings. Its fully open-source philosophy aligns perfectly with OpenClaw's open-source positioning. Through Ollama, it can be run locally with a single click, requiring no API calls or network connections, perfectly protecting user privacy. The efficient inference of the MoE architecture makes it suitable for running on personal devices. The Apache 2.0 license also ensures that OpenClaw can freely integrate and distribute it.
External References
Learn more from these authoritative sources: