Instructor Embedding - Instruction Embedding
Basic Information
- Product Name: INSTRUCTOR
- Development Team: xlang-ai (The University of Hong Kong)
- Country/Region: China (Hong Kong)
- Official Website: https://instructor-embedding.github.io/
- GitHub: https://github.com/xlang-ai/instructor-embedding
- Paper: "One Embedder, Any Task" (ACL 2023)
- Type: Open-source instruction-tuned embedding model
- License: Apache-2.0
Product Description
INSTRUCTOR is an innovative text embedding method that customizes embeddings through instructions. Unlike traditional embedding models, INSTRUCTOR can receive task instructions (such as "generate vectors for information retrieval" or "generate vectors for sentiment classification") during each embedding process, thereby generating embeddings optimized for specific tasks and domains without any additional fine-tuning training. This embodies the design philosophy of "one embedder, any task."
Core Features
- Instruction-Driven Embedding: Customize embedding behavior through natural language instructions
- Zero-Shot Task Adaptation: Adapt to different tasks and domains without additional training
- 330-Task Training: Trained on 330 different tasks using contrastive loss
- Extensive Evaluation: Tested on 70 embedding evaluation tasks (64 of which were unseen during training)
- Efficient Parameter Utilization: Parameter count is an order of magnitude lower than the previous best model, yet performance remains leading
- Multi-Task Support: Classification, information retrieval, semantic similarity, text generation evaluation, etc.
Model Matrix
| Model | Parameters | Dimensions | Features |
|---|---|---|---|
| instructor-xl | ~1.5B | 768 | Highest accuracy |
| instructor-large | ~335M | 768 | Balanced choice |
| instructor-base | ~110M | 768 | Lightweight and efficient |
Business Model
- Completely Open Source and Free: Apache-2.0 license
- Available on Hugging Face: Directly downloadable from Hugging Face
- Framework Integration: Integrated with mainstream frameworks like LangChain, Haystack, etc.
Target Users
- Developers needing embeddings optimized for specific tasks
- Multi-task NLP application developers
- Academic researchers
- Teams that do not want to train dedicated embedding models for each task
Competitive Advantages
- Unique instruction-driven embedding approach, one model adapts to all tasks
- Average improvement of 3.4% across 70 evaluation tasks
- Zero-shot adaptation to new tasks without retraining
- Parameter-efficient, an order of magnitude smaller than models with similar performance
- Strong academic background (ACL 2023 paper)
Usage Example
from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR('hkunlp/instructor-large')
# Different tasks use different instructions
query = [["Represent the query for retrieval:", "What is RAG?"]]
doc = [["Represent the document for retrieval:", "RAG combines retrieval with generation..."]]
Relationship with OpenClaw Ecosystem
INSTRUCTOR's instruction-driven embedding approach is highly suitable for OpenClaw's multi-task agent scenarios. When OpenClaw agents handle different types of tasks (searching documents, classifying emails, semantic matching, etc.), they can use different instructions to generate embeddings optimized for specific tasks with the same INSTRUCTOR model, without deploying and switching multiple dedicated models.
External References
Learn more from these authoritative sources: