Nomic Embed - Open Source Embeddings

Fully Open Source Embedding Model N AI Processing & RAG

Basic Information

Product Description

Nomic Embed is a series of fully open-source embedding models launched by Nomic AI, centered around the core concept of "fully reproducible"—open-sourcing model weights, training data, and training code to ensure transparency and reproducibility in research. The latest Nomic Embed Text V2 is the world's first general-purpose text embedding model using the MoE (Mixture of Experts) architecture, trained on 1.6 billion contrastive pairs across approximately 100 languages.

Core Features/Characteristics

Nomic Embed Text V2 (MoE)

  • MoE Architecture: First MoE general-purpose text embedding model
  • Efficient Inference: Alternating MoE layers, top-2 routing with 8 experts, activating only 305M/475M parameters during inference
  • Multilingual: Supports approximately 100 languages
  • Matryoshka Representation: Supports dimensionality reduction from 768 to 256
  • Strong Performance: Excels on BEIR and MIRACL benchmarks, comparable to models with twice the parameters
  • Fully Open Source: Weights, training data, and training code are all publicly available

Nomic Embed Text V1

  • First Fully Reproducible: The first fully reproducible open-source English embedding model with an 8192 context length
  • Surpasses OpenAI: Outperformed Ada-002 and text-embedding-3-small at the time of release
  • 8192 Context Length: Supports long documents

Technical Specifications

FeatureV2 (MoE)V1
Parameters475M (305M activated)137M
Max Dimensions768768
Trimmable Dimensions256256
Context Length81928192
Language Support~100 languagesEnglish
ArchitectureMoEDense

Business Model

  • Fully Free and Open Source: Apache License 2.0
  • Nomic Atlas: Visualization and data exploration platform (commercial product)
  • Ollama Integration: Can be run locally directly via Ollama
  • Free Deployment: No commercial usage restrictions

Target Users

  • Open-source community developers
  • Researchers requiring full transparency and reproducibility
  • Privacy-sensitive scenarios requiring local deployment
  • Small teams with limited resources but in need of high-quality embeddings
  • Developers of multilingual retrieval systems

Competitive Advantages

  • The only fully open-source (weights + data + code) high-quality embedding model
  • Innovative MoE architecture with high inference efficiency (only 64% parameters activated)
  • Apache 2.0 license with no commercial restrictions
  • Performance comparable to models with twice the parameters
  • Supports one-click local running via Ollama
  • Multilingual capabilities across approximately 100 languages
  • Reproducible training process ensures transparency

Limitations

  • Still lags behind commercial models (Cohere, OpenAI) on some benchmarks
  • V1 only supports English
  • Lacks commercial-grade API and technical support
  • Smaller community compared to models like BGE

Relationship with the OpenClaw Ecosystem

Nomic Embed is an ideal choice for OpenClaw's local embeddings. Its fully open-source philosophy aligns perfectly with OpenClaw's open-source positioning. Through Ollama, it can be run locally with a single click, requiring no API calls or network connections, perfectly protecting user privacy. The efficient inference of the MoE architecture makes it suitable for running on personal devices. The Apache 2.0 license also ensures that OpenClaw can freely integrate and distribute it.

External References

Learn more from these authoritative sources: