embedding-models
Guide to selecting, using, and optimizing text embedding models for RAG pipelines. Covers commercial models (OpenAI text-embedding-3, Cohere embed-v3, Voyage AI) and open-source options (BGE, E5, Nomic Embed). Includes dimensionality selection, batch processing, embedding caching, fine-tuning for domain-specific retrieval, and cost analysis.
Choose, deploy, and optimize embedding models for high-quality vector retrieval. ## Key Points - Domain-specific jargon not in general models (medical codes, legal citations) - Retrieval recall below 70% on your eval set with best off-the-shelf model - You have at least 1000 query-passage pairs for training - General knowledge domains (try better chunking first) - Fewer than 500 training examples - The bottleneck is generation, not retrieval 1. **Mixing embedding models** -- Never embed queries with one model and documents with another. The vector spaces are incompatible. 2. **Ignoring query/document prefixes** -- BGE needs "Represent this sentence...", E5 needs "query:"/"passage:", Cohere needs input_type. Omitting these degrades quality by 5-15%. 3. **Re-embedding unchanged documents** -- Always cache embeddings. Re-indexing a 100K document corpus costs time and money for zero benefit. 4. **Using max dimensions when unnecessary** -- Matryoshka models let you trade 2-5% quality for 50-75% storage savings. Always benchmark reduced dimensions. 5. **Embedding very long text without chunking** -- Models have token limits (512-8192). Text beyond the limit is silently truncated, losing information. - [ ] Benchmark 2-3 models on your eval set before committing ## Quick Example ```python # Full dimensions vs. reduced full = embed_texts(["test"], model="text-embedding-3-small") # 1536 dims half = embed_texts(["test"], model="text-embedding-3-small", dimensions=768) # 768 dims quarter = embed_texts(["test"], model="text-embedding-3-small", dimensions=384) # 384 dims ```
skilldb get rag-pipeline-skills/embedding-modelsFull skill: 357 linesEmbedding Models
Choose, deploy, and optimize embedding models for high-quality vector retrieval.
Model Landscape
Commercial Embedding APIs
| Model | Dimensions | Max Tokens | Cost (per 1M tokens) | Strengths |
|---|---|---|---|---|
| OpenAI text-embedding-3-small | 1536 (configurable) | 8191 | ~$0.02 | Cheapest, good quality |
| OpenAI text-embedding-3-large | 3072 (configurable) | 8191 | ~$0.13 | Highest quality from OpenAI |
| Cohere embed-v3 | 1024 | 512 | ~$0.10 | Input types, multilingual, compression |
| Voyage voyage-3 | 1024 | 32000 | ~$0.06 | Long context, code-optimized variant |
| Google text-embedding-005 | 768 | 2048 | ~$0.00 (free tier) | Good for GCP-native stacks |
Open-Source Models
| Model | Dimensions | MTEB Score | Parameters | Notes |
|---|---|---|---|---|
| BAAI/bge-large-en-v1.5 | 1024 | 64.2 | 335M | Top open-source English |
| BAAI/bge-m3 | 1024 | — | 568M | Multilingual, multi-granularity |
| intfloat/e5-large-v2 | 1024 | 62.7 | 335M | Strong general-purpose |
| intfloat/multilingual-e5-large | 1024 | — | 560M | 100+ languages |
| nomic-ai/nomic-embed-text-v1.5 | 768 | 62.3 | 137M | Small, Matryoshka support |
| Alibaba/gte-large-en-v1.5 | 1024 | 65.4 | 434M | Strong English benchmark |
Using Commercial APIs
OpenAI Embeddings
from openai import OpenAI
client = OpenAI()
def embed_texts(texts, model="text-embedding-3-small", dimensions=None):
"""Embed a batch of texts."""
kwargs = {"input": texts, "model": model}
if dimensions:
kwargs["dimensions"] = dimensions # Matryoshka: reduce dims
response = client.embeddings.create(**kwargs)
return [item.embedding for item in response.data]
# Single text
embedding = embed_texts(["How does authentication work?"])[0]
# Reduced dimensions for cost/speed (Matryoshka)
small_embedding = embed_texts(
["How does authentication work?"],
dimensions=512 # Down from 1536, ~5% quality loss
)[0]
Cohere Embeddings
import cohere
co = cohere.Client()
# Cohere supports input_type for better quality
query_embedding = co.embed(
texts=["How does auth work?"],
model="embed-english-v3.0",
input_type="search_query", # Use for queries
embedding_types=["float"],
).embeddings.float[0]
doc_embeddings = co.embed(
texts=["Authentication uses JWT tokens...", "OAuth2 flow..."],
model="embed-english-v3.0",
input_type="search_document", # Use for documents
embedding_types=["float"],
).embeddings.float
Voyage AI
import voyageai
vo = voyageai.Client()
# General embedding
result = vo.embed(
["How does authentication work?"],
model="voyage-3",
input_type="query"
)
query_embedding = result.embeddings[0]
# Code-specific model
code_result = vo.embed(
["def authenticate(user, password):"],
model="voyage-code-3",
input_type="document"
)
Using Open-Source Models
With sentence-transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("BAAI/bge-large-en-v1.5")
# BGE models need a query prefix for retrieval
queries = ["Represent this sentence for searching relevant passages: How does auth work?"]
docs = ["Authentication uses JWT tokens issued by the identity provider."]
query_embeddings = model.encode(queries, normalize_embeddings=True)
doc_embeddings = model.encode(docs, normalize_embeddings=True)
# Cosine similarity (normalized vectors, so dot product works)
import numpy as np
similarity = np.dot(query_embeddings[0], doc_embeddings[0])
With Hugging Face Transformers (direct)
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("intfloat/e5-large-v2")
model = AutoModel.from_pretrained("intfloat/e5-large-v2")
def embed_e5(texts, prefix="passage: "):
"""E5 models need 'query: ' or 'passage: ' prefix."""
texts = [prefix + t for t in texts]
encoded = tokenizer(texts, padding=True, truncation=True, return_tensors="pt", max_length=512)
with torch.no_grad():
outputs = model(**encoded)
# Mean pooling
attention_mask = encoded["attention_mask"]
token_embeddings = outputs.last_hidden_state
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
embeddings = torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
# Normalize
embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
return embeddings.numpy()
query_emb = embed_e5(["How does auth work?"], prefix="query: ")
doc_emb = embed_e5(["Authentication uses JWT tokens."], prefix="passage: ")
Dimensionality Choices
Matryoshka Representations
Models like OpenAI text-embedding-3 and Nomic Embed support truncating dimensions without retraining.
# Full dimensions vs. reduced
full = embed_texts(["test"], model="text-embedding-3-small") # 1536 dims
half = embed_texts(["test"], model="text-embedding-3-small", dimensions=768) # 768 dims
quarter = embed_texts(["test"], model="text-embedding-3-small", dimensions=384) # 384 dims
| Dimensions | Storage per 1M docs | Quality Impact | Use Case |
|---|---|---|---|
| 1536 | ~6 GB | Baseline | High-quality production |
| 768 | ~3 GB | ~2-3% drop | Good balance |
| 384 | ~1.5 GB | ~5-8% drop | Cost-constrained, large corpus |
| 256 | ~1 GB | ~10-15% drop | Rough filtering only |
Batch Processing
Efficient Batch Embedding
import time
from typing import List
def batch_embed(texts: List[str], batch_size: int = 100, model: str = "text-embedding-3-small"):
"""Embed texts in batches with rate limiting."""
all_embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
embeddings = embed_texts(batch, model=model)
all_embeddings.extend(embeddings)
if i + batch_size < len(texts):
time.sleep(0.1) # Basic rate limiting
return all_embeddings
# For large corpora: async batching
import asyncio
from openai import AsyncOpenAI
async_client = AsyncOpenAI()
async def async_embed_batch(texts: List[str], batch_size: int = 100):
"""Parallel async embedding."""
async def embed_one_batch(batch):
response = await async_client.embeddings.create(
input=batch, model="text-embedding-3-small"
)
return [item.embedding for item in response.data]
batches = [texts[i:i+batch_size] for i in range(0, len(texts), batch_size)]
# Process 5 batches concurrently
results = []
for i in range(0, len(batches), 5):
group = batches[i:i+5]
group_results = await asyncio.gather(*[embed_one_batch(b) for b in group])
for r in group_results:
results.extend(r)
return results
Embedding Cache
import hashlib
import json
import sqlite3
from typing import List, Optional
class EmbeddingCache:
"""SQLite-backed embedding cache to avoid re-embedding identical text."""
def __init__(self, db_path: str = "embedding_cache.db", model: str = "text-embedding-3-small"):
self.model = model
self.conn = sqlite3.connect(db_path)
self.conn.execute("""
CREATE TABLE IF NOT EXISTS cache (
text_hash TEXT PRIMARY KEY,
model TEXT,
embedding BLOB
)
""")
def _hash(self, text: str) -> str:
return hashlib.sha256(f"{self.model}:{text}".encode()).hexdigest()
def get(self, text: str) -> Optional[List[float]]:
row = self.conn.execute(
"SELECT embedding FROM cache WHERE text_hash = ?",
(self._hash(text),)
).fetchone()
if row:
return json.loads(row[0])
return None
def put(self, text: str, embedding: List[float]):
self.conn.execute(
"INSERT OR REPLACE INTO cache (text_hash, model, embedding) VALUES (?, ?, ?)",
(self._hash(text), self.model, json.dumps(embedding))
)
self.conn.commit()
def embed_with_cache(self, texts: List[str]) -> List[List[float]]:
results = [None] * len(texts)
uncached_indices = []
for i, text in enumerate(texts):
cached = self.get(text)
if cached:
results[i] = cached
else:
uncached_indices.append(i)
if uncached_indices:
uncached_texts = [texts[i] for i in uncached_indices]
new_embeddings = embed_texts(uncached_texts, model=self.model)
for idx, emb in zip(uncached_indices, new_embeddings):
results[idx] = emb
self.put(texts[idx], emb)
return results
Fine-Tuning Embeddings
When off-the-shelf models underperform on your domain (medical, legal, niche technical).
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
# Prepare training data: (query, positive_passage, negative_passage)
train_examples = [
InputExample(texts=["What is OAuth2?", "OAuth2 is an authorization framework...", "Python is a programming language..."]),
InputExample(texts=["JWT expiry", "JSON Web Tokens have a configurable expiry...", "CSS Grid layouts allow..."]),
]
model = SentenceTransformer("BAAI/bge-large-en-v1.5")
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.TripletLoss(model=model)
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=3,
warmup_steps=100,
output_path="./fine-tuned-embeddings",
)
When to fine-tune:
- Domain-specific jargon not in general models (medical codes, legal citations)
- Retrieval recall below 70% on your eval set with best off-the-shelf model
- You have at least 1000 query-passage pairs for training
When NOT to fine-tune:
- General knowledge domains (try better chunking first)
- Fewer than 500 training examples
- The bottleneck is generation, not retrieval
Anti-Patterns
-
Mixing embedding models -- Never embed queries with one model and documents with another. The vector spaces are incompatible.
-
Ignoring query/document prefixes -- BGE needs "Represent this sentence...", E5 needs "query:"/"passage:", Cohere needs input_type. Omitting these degrades quality by 5-15%.
-
Re-embedding unchanged documents -- Always cache embeddings. Re-indexing a 100K document corpus costs time and money for zero benefit.
-
Using max dimensions when unnecessary -- Matryoshka models let you trade 2-5% quality for 50-75% storage savings. Always benchmark reduced dimensions.
-
Embedding very long text without chunking -- Models have token limits (512-8192). Text beyond the limit is silently truncated, losing information.
Selection Checklist
- Benchmark 2-3 models on your eval set before committing
- Match query/document prefixes to model requirements
- Choose dimensions based on corpus size and quality needs
- Implement embedding cache before production indexing
- Test multilingual support if your corpus is not English-only
- Verify token limits match your chunk sizes
- Calculate monthly embedding cost at expected query volume
Install this skill directly: skilldb add rag-pipeline-skills
Related Skills
advanced-rag
Advanced RAG patterns beyond basic retrieve-and-generate. Covers multi-hop RAG, agentic RAG with tool use, graph RAG (knowledge graphs + vector retrieval), recursive retrieval, self-querying retrievers, query decomposition, citation extraction, and corrective RAG. Includes implementation patterns and guidance on when each advanced technique is warranted.
chunking-strategies
Comprehensive guide to document chunking strategies for RAG pipelines. Covers fixed-size, semantic, recursive character, sentence-based, parent-child, markdown-aware, and code-aware chunking. Includes chunk size optimization, overlap strategies, and practical benchmarks for choosing the right approach based on document type and retrieval quality.
rag-evaluation
Evaluating RAG systems end-to-end. Covers retrieval metrics (context precision, context recall, MRR), generation metrics (faithfulness, answer relevance, hallucination detection), the RAGAS framework, human evaluation protocols, A/B testing retrieval strategies, building evaluation datasets, and continuous monitoring in production.
rag-fundamentals
Teaches the foundational architecture of Retrieval-Augmented Generation (RAG) systems. Covers why RAG outperforms fine-tuning for most knowledge-grounding use cases, the three core stages (indexing, retrieval, generation), component design, latency budgets, and evaluation metrics including faithfulness, relevance, and hallucination rate. Use when building or explaining any RAG system from scratch.
rag-production
Production-grade RAG deployment patterns. Covers caching strategies (semantic and exact), streaming responses, token budget management, fallback strategies for retrieval failures, monitoring retrieval quality, cost optimization, incremental indexing, multi-tenancy, and operational best practices for running RAG systems at scale.
rag-with-langchain
Building RAG pipelines with LangChain and LangGraph. Covers document loaders, text splitters, vector stores, retrievers, chains, and agents. Includes practical patterns for conversational RAG, multi-source retrieval, streaming, and LangGraph-based agentic RAG workflows.