Skip to main content
Technology & EngineeringRag Pipeline357 lines

embedding-models

Guide to selecting, using, and optimizing text embedding models for RAG pipelines. Covers commercial models (OpenAI text-embedding-3, Cohere embed-v3, Voyage AI) and open-source options (BGE, E5, Nomic Embed). Includes dimensionality selection, batch processing, embedding caching, fine-tuning for domain-specific retrieval, and cost analysis.

Quick Summary27 lines
Choose, deploy, and optimize embedding models for high-quality vector retrieval.

## Key Points

- Domain-specific jargon not in general models (medical codes, legal citations)
- Retrieval recall below 70% on your eval set with best off-the-shelf model
- You have at least 1000 query-passage pairs for training
- General knowledge domains (try better chunking first)
- Fewer than 500 training examples
- The bottleneck is generation, not retrieval
1. **Mixing embedding models** -- Never embed queries with one model and documents with another. The vector spaces are incompatible.
2. **Ignoring query/document prefixes** -- BGE needs "Represent this sentence...", E5 needs "query:"/"passage:", Cohere needs input_type. Omitting these degrades quality by 5-15%.
3. **Re-embedding unchanged documents** -- Always cache embeddings. Re-indexing a 100K document corpus costs time and money for zero benefit.
4. **Using max dimensions when unnecessary** -- Matryoshka models let you trade 2-5% quality for 50-75% storage savings. Always benchmark reduced dimensions.
5. **Embedding very long text without chunking** -- Models have token limits (512-8192). Text beyond the limit is silently truncated, losing information.
- [ ] Benchmark 2-3 models on your eval set before committing

## Quick Example

```python
# Full dimensions vs. reduced
full = embed_texts(["test"], model="text-embedding-3-small")          # 1536 dims
half = embed_texts(["test"], model="text-embedding-3-small", dimensions=768)  # 768 dims
quarter = embed_texts(["test"], model="text-embedding-3-small", dimensions=384)  # 384 dims
```
skilldb get rag-pipeline-skills/embedding-modelsFull skill: 357 lines
Paste into your CLAUDE.md or agent config

Embedding Models

Choose, deploy, and optimize embedding models for high-quality vector retrieval.


Model Landscape

Commercial Embedding APIs

ModelDimensionsMax TokensCost (per 1M tokens)Strengths
OpenAI text-embedding-3-small1536 (configurable)8191~$0.02Cheapest, good quality
OpenAI text-embedding-3-large3072 (configurable)8191~$0.13Highest quality from OpenAI
Cohere embed-v31024512~$0.10Input types, multilingual, compression
Voyage voyage-3102432000~$0.06Long context, code-optimized variant
Google text-embedding-0057682048~$0.00 (free tier)Good for GCP-native stacks

Open-Source Models

ModelDimensionsMTEB ScoreParametersNotes
BAAI/bge-large-en-v1.5102464.2335MTop open-source English
BAAI/bge-m31024568MMultilingual, multi-granularity
intfloat/e5-large-v2102462.7335MStrong general-purpose
intfloat/multilingual-e5-large1024560M100+ languages
nomic-ai/nomic-embed-text-v1.576862.3137MSmall, Matryoshka support
Alibaba/gte-large-en-v1.5102465.4434MStrong English benchmark

Using Commercial APIs

OpenAI Embeddings

from openai import OpenAI

client = OpenAI()

def embed_texts(texts, model="text-embedding-3-small", dimensions=None):
    """Embed a batch of texts."""
    kwargs = {"input": texts, "model": model}
    if dimensions:
        kwargs["dimensions"] = dimensions  # Matryoshka: reduce dims
    response = client.embeddings.create(**kwargs)
    return [item.embedding for item in response.data]

# Single text
embedding = embed_texts(["How does authentication work?"])[0]

# Reduced dimensions for cost/speed (Matryoshka)
small_embedding = embed_texts(
    ["How does authentication work?"],
    dimensions=512  # Down from 1536, ~5% quality loss
)[0]

Cohere Embeddings

import cohere

co = cohere.Client()

# Cohere supports input_type for better quality
query_embedding = co.embed(
    texts=["How does auth work?"],
    model="embed-english-v3.0",
    input_type="search_query",      # Use for queries
    embedding_types=["float"],
).embeddings.float[0]

doc_embeddings = co.embed(
    texts=["Authentication uses JWT tokens...", "OAuth2 flow..."],
    model="embed-english-v3.0",
    input_type="search_document",   # Use for documents
    embedding_types=["float"],
).embeddings.float

Voyage AI

import voyageai

vo = voyageai.Client()

# General embedding
result = vo.embed(
    ["How does authentication work?"],
    model="voyage-3",
    input_type="query"
)
query_embedding = result.embeddings[0]

# Code-specific model
code_result = vo.embed(
    ["def authenticate(user, password):"],
    model="voyage-code-3",
    input_type="document"
)

Using Open-Source Models

With sentence-transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-large-en-v1.5")

# BGE models need a query prefix for retrieval
queries = ["Represent this sentence for searching relevant passages: How does auth work?"]
docs = ["Authentication uses JWT tokens issued by the identity provider."]

query_embeddings = model.encode(queries, normalize_embeddings=True)
doc_embeddings = model.encode(docs, normalize_embeddings=True)

# Cosine similarity (normalized vectors, so dot product works)
import numpy as np
similarity = np.dot(query_embeddings[0], doc_embeddings[0])

With Hugging Face Transformers (direct)

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("intfloat/e5-large-v2")
model = AutoModel.from_pretrained("intfloat/e5-large-v2")

def embed_e5(texts, prefix="passage: "):
    """E5 models need 'query: ' or 'passage: ' prefix."""
    texts = [prefix + t for t in texts]
    encoded = tokenizer(texts, padding=True, truncation=True, return_tensors="pt", max_length=512)
    with torch.no_grad():
        outputs = model(**encoded)
    # Mean pooling
    attention_mask = encoded["attention_mask"]
    token_embeddings = outputs.last_hidden_state
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    embeddings = torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    # Normalize
    embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
    return embeddings.numpy()

query_emb = embed_e5(["How does auth work?"], prefix="query: ")
doc_emb = embed_e5(["Authentication uses JWT tokens."], prefix="passage: ")

Dimensionality Choices

Matryoshka Representations

Models like OpenAI text-embedding-3 and Nomic Embed support truncating dimensions without retraining.

# Full dimensions vs. reduced
full = embed_texts(["test"], model="text-embedding-3-small")          # 1536 dims
half = embed_texts(["test"], model="text-embedding-3-small", dimensions=768)  # 768 dims
quarter = embed_texts(["test"], model="text-embedding-3-small", dimensions=384)  # 384 dims
DimensionsStorage per 1M docsQuality ImpactUse Case
1536~6 GBBaselineHigh-quality production
768~3 GB~2-3% dropGood balance
384~1.5 GB~5-8% dropCost-constrained, large corpus
256~1 GB~10-15% dropRough filtering only

Batch Processing

Efficient Batch Embedding

import time
from typing import List

def batch_embed(texts: List[str], batch_size: int = 100, model: str = "text-embedding-3-small"):
    """Embed texts in batches with rate limiting."""
    all_embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        embeddings = embed_texts(batch, model=model)
        all_embeddings.extend(embeddings)
        if i + batch_size < len(texts):
            time.sleep(0.1)  # Basic rate limiting
    return all_embeddings

# For large corpora: async batching
import asyncio
from openai import AsyncOpenAI

async_client = AsyncOpenAI()

async def async_embed_batch(texts: List[str], batch_size: int = 100):
    """Parallel async embedding."""
    async def embed_one_batch(batch):
        response = await async_client.embeddings.create(
            input=batch, model="text-embedding-3-small"
        )
        return [item.embedding for item in response.data]

    batches = [texts[i:i+batch_size] for i in range(0, len(texts), batch_size)]
    # Process 5 batches concurrently
    results = []
    for i in range(0, len(batches), 5):
        group = batches[i:i+5]
        group_results = await asyncio.gather(*[embed_one_batch(b) for b in group])
        for r in group_results:
            results.extend(r)
    return results

Embedding Cache

import hashlib
import json
import sqlite3
from typing import List, Optional

class EmbeddingCache:
    """SQLite-backed embedding cache to avoid re-embedding identical text."""

    def __init__(self, db_path: str = "embedding_cache.db", model: str = "text-embedding-3-small"):
        self.model = model
        self.conn = sqlite3.connect(db_path)
        self.conn.execute("""
            CREATE TABLE IF NOT EXISTS cache (
                text_hash TEXT PRIMARY KEY,
                model TEXT,
                embedding BLOB
            )
        """)

    def _hash(self, text: str) -> str:
        return hashlib.sha256(f"{self.model}:{text}".encode()).hexdigest()

    def get(self, text: str) -> Optional[List[float]]:
        row = self.conn.execute(
            "SELECT embedding FROM cache WHERE text_hash = ?",
            (self._hash(text),)
        ).fetchone()
        if row:
            return json.loads(row[0])
        return None

    def put(self, text: str, embedding: List[float]):
        self.conn.execute(
            "INSERT OR REPLACE INTO cache (text_hash, model, embedding) VALUES (?, ?, ?)",
            (self._hash(text), self.model, json.dumps(embedding))
        )
        self.conn.commit()

    def embed_with_cache(self, texts: List[str]) -> List[List[float]]:
        results = [None] * len(texts)
        uncached_indices = []

        for i, text in enumerate(texts):
            cached = self.get(text)
            if cached:
                results[i] = cached
            else:
                uncached_indices.append(i)

        if uncached_indices:
            uncached_texts = [texts[i] for i in uncached_indices]
            new_embeddings = embed_texts(uncached_texts, model=self.model)
            for idx, emb in zip(uncached_indices, new_embeddings):
                results[idx] = emb
                self.put(texts[idx], emb)

        return results

Fine-Tuning Embeddings

When off-the-shelf models underperform on your domain (medical, legal, niche technical).

from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

# Prepare training data: (query, positive_passage, negative_passage)
train_examples = [
    InputExample(texts=["What is OAuth2?", "OAuth2 is an authorization framework...", "Python is a programming language..."]),
    InputExample(texts=["JWT expiry", "JSON Web Tokens have a configurable expiry...", "CSS Grid layouts allow..."]),
]

model = SentenceTransformer("BAAI/bge-large-en-v1.5")
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.TripletLoss(model=model)

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=3,
    warmup_steps=100,
    output_path="./fine-tuned-embeddings",
)

When to fine-tune:

  • Domain-specific jargon not in general models (medical codes, legal citations)
  • Retrieval recall below 70% on your eval set with best off-the-shelf model
  • You have at least 1000 query-passage pairs for training

When NOT to fine-tune:

  • General knowledge domains (try better chunking first)
  • Fewer than 500 training examples
  • The bottleneck is generation, not retrieval

Anti-Patterns

  1. Mixing embedding models -- Never embed queries with one model and documents with another. The vector spaces are incompatible.

  2. Ignoring query/document prefixes -- BGE needs "Represent this sentence...", E5 needs "query:"/"passage:", Cohere needs input_type. Omitting these degrades quality by 5-15%.

  3. Re-embedding unchanged documents -- Always cache embeddings. Re-indexing a 100K document corpus costs time and money for zero benefit.

  4. Using max dimensions when unnecessary -- Matryoshka models let you trade 2-5% quality for 50-75% storage savings. Always benchmark reduced dimensions.

  5. Embedding very long text without chunking -- Models have token limits (512-8192). Text beyond the limit is silently truncated, losing information.


Selection Checklist

  • Benchmark 2-3 models on your eval set before committing
  • Match query/document prefixes to model requirements
  • Choose dimensions based on corpus size and quality needs
  • Implement embedding cache before production indexing
  • Test multilingual support if your corpus is not English-only
  • Verify token limits match your chunk sizes
  • Calculate monthly embedding cost at expected query volume

Install this skill directly: skilldb add rag-pipeline-skills

Get CLI access →

Related Skills

advanced-rag

Advanced RAG patterns beyond basic retrieve-and-generate. Covers multi-hop RAG, agentic RAG with tool use, graph RAG (knowledge graphs + vector retrieval), recursive retrieval, self-querying retrievers, query decomposition, citation extraction, and corrective RAG. Includes implementation patterns and guidance on when each advanced technique is warranted.

Rag Pipeline464L

chunking-strategies

Comprehensive guide to document chunking strategies for RAG pipelines. Covers fixed-size, semantic, recursive character, sentence-based, parent-child, markdown-aware, and code-aware chunking. Includes chunk size optimization, overlap strategies, and practical benchmarks for choosing the right approach based on document type and retrieval quality.

Rag Pipeline343L

rag-evaluation

Evaluating RAG systems end-to-end. Covers retrieval metrics (context precision, context recall, MRR), generation metrics (faithfulness, answer relevance, hallucination detection), the RAGAS framework, human evaluation protocols, A/B testing retrieval strategies, building evaluation datasets, and continuous monitoring in production.

Rag Pipeline501L

rag-fundamentals

Teaches the foundational architecture of Retrieval-Augmented Generation (RAG) systems. Covers why RAG outperforms fine-tuning for most knowledge-grounding use cases, the three core stages (indexing, retrieval, generation), component design, latency budgets, and evaluation metrics including faithfulness, relevance, and hallucination rate. Use when building or explaining any RAG system from scratch.

Rag Pipeline266L

rag-production

Production-grade RAG deployment patterns. Covers caching strategies (semantic and exact), streaming responses, token budget management, fallback strategies for retrieval failures, monitoring retrieval quality, cost optimization, incremental indexing, multi-tenancy, and operational best practices for running RAG systems at scale.

Rag Pipeline498L

rag-with-langchain

Building RAG pipelines with LangChain and LangGraph. Covers document loaders, text splitters, vector stores, retrievers, chains, and agents. Includes practical patterns for conversational RAG, multi-source retrieval, streaming, and LangGraph-based agentic RAG workflows.

Rag Pipeline460L