Technology & EngineeringRag Pipeline357 lines

embedding-models

Guide to selecting, using, and optimizing text embedding models for RAG pipelines. Covers commercial models (OpenAI text-embedding-3, Cohere embed-v3, Voyage AI) and open-source options (BGE, E5, Nomic Embed). Includes dimensionality selection, batch processing, embedding caching, fine-tuning for domain-specific retrieval, and cost analysis.

Quick Summary27 lines

Choose, deploy, and optimize embedding models for high-quality vector retrieval.

## Key Points

- Domain-specific jargon not in general models (medical codes, legal citations)
- Retrieval recall below 70% on your eval set with best off-the-shelf model
- You have at least 1000 query-passage pairs for training
- General knowledge domains (try better chunking first)
- Fewer than 500 training examples
- The bottleneck is generation, not retrieval
1. **Mixing embedding models** -- Never embed queries with one model and documents with another. The vector spaces are incompatible.
2. **Ignoring query/document prefixes** -- BGE needs "Represent this sentence...", E5 needs "query:"/"passage:", Cohere needs input_type. Omitting these degrades quality by 5-15%.
3. **Re-embedding unchanged documents** -- Always cache embeddings. Re-indexing a 100K document corpus costs time and money for zero benefit.
4. **Using max dimensions when unnecessary** -- Matryoshka models let you trade 2-5% quality for 50-75% storage savings. Always benchmark reduced dimensions.
5. **Embedding very long text without chunking** -- Models have token limits (512-8192). Text beyond the limit is silently truncated, losing information.
- [ ] Benchmark 2-3 models on your eval set before committing

## Quick Example

```python
# Full dimensions vs. reduced
full = embed_texts(["test"], model="text-embedding-3-small")          # 1536 dims
half = embed_texts(["test"], model="text-embedding-3-small", dimensions=768)  # 768 dims
quarter = embed_texts(["test"], model="text-embedding-3-small", dimensions=384)  # 384 dims
```

skilldb get rag-pipeline-skills/embedding-modelsFull skill: 357 lines

Paste into your CLAUDE.md or agent config

Embedding Models

Choose, deploy, and optimize embedding models for high-quality vector retrieval.

Model Landscape

Commercial Embedding APIs

Model	Dimensions	Max Tokens	Cost (per 1M tokens)	Strengths
OpenAI text-embedding-3-small	1536 (configurable)	8191	~$0.02	Cheapest, good quality
OpenAI text-embedding-3-large	3072 (configurable)	8191	~$0.13	Highest quality from OpenAI
Cohere embed-v3	1024	512	~$0.10	Input types, multilingual, compression
Voyage voyage-3	1024	32000	~$0.06	Long context, code-optimized variant
Google text-embedding-005	768	2048	~$0.00 (free tier)	Good for GCP-native stacks

Open-Source Models

Model	Dimensions	MTEB Score	Parameters	Notes
BAAI/bge-large-en-v1.5	1024	64.2	335M	Top open-source English
BAAI/bge-m3	1024	—	568M	Multilingual, multi-granularity
intfloat/e5-large-v2	1024	62.7	335M	Strong general-purpose
intfloat/multilingual-e5-large	1024	—	560M	100+ languages
nomic-ai/nomic-embed-text-v1.5	768	62.3	137M	Small, Matryoshka support
Alibaba/gte-large-en-v1.5	1024	65.4	434M	Strong English benchmark

Using Commercial APIs

OpenAI Embeddings

from openai import OpenAI

client = OpenAI()

def embed_texts(texts, model="text-embedding-3-small", dimensions=None):
    """Embed a batch of texts."""
    kwargs = {"input": texts, "model": model}
    if dimensions:
        kwargs["dimensions"] = dimensions  # Matryoshka: reduce dims
    response = client.embeddings.create(**kwargs)
    return [item.embedding for item in response.data]

# Single text
embedding = embed_texts(["How does authentication work?"])[0]

# Reduced dimensions for cost/speed (Matryoshka)
small_embedding = embed_texts(
    ["How does authentication work?"],
    dimensions=512  # Down from 1536, ~5% quality loss
)[0]

Cohere Embeddings

import cohere

co = cohere.Client()

# Cohere supports input_type for better quality
query_embedding = co.embed(
    texts=["How does auth work?"],
    model="embed-english-v3.0",
    input_type="search_query",      # Use for queries
    embedding_types=["float"],
).embeddings.float[0]

doc_embeddings = co.embed(
    texts=["Authentication uses JWT tokens...", "OAuth2 flow..."],
    model="embed-english-v3.0",
    input_type="search_document",   # Use for documents
    embedding_types=["float"],
).embeddings.float

Voyage AI

import voyageai

vo = voyageai.Client()

# General embedding
result = vo.embed(
    ["How does authentication work?"],
    model="voyage-3",
    input_type="query"
)
query_embedding = result.embeddings[0]

# Code-specific model
code_result = vo.embed(
    ["def authenticate(user, password):"],
    model="voyage-code-3",
    input_type="document"
)

Using Open-Source Models

With sentence-transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-large-en-v1.5")

# BGE models need a query prefix for retrieval
queries = ["Represent this sentence for searching relevant passages: How does auth work?"]
docs = ["Authentication uses JWT tokens issued by the identity provider."]

query_embeddings = model.encode(queries, normalize_embeddings=True)
doc_embeddings = model.encode(docs, normalize_embeddings=True)

# Cosine similarity (normalized vectors, so dot product works)
import numpy as np
similarity = np.dot(query_embeddings[0], doc_embeddings[0])

With Hugging Face Transformers (direct)

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("intfloat/e5-large-v2")
model = AutoModel.from_pretrained("intfloat/e5-large-v2")

def embed_e5(texts, prefix="passage: "):
    """E5 models need 'query: ' or 'passage: ' prefix."""
    texts = [prefix + t for t in texts]
    encoded = tokenizer(texts, padding=True, truncation=True, return_tensors="pt", max_length=512)
    with torch.no_grad():
        outputs = model(**encoded)
    # Mean pooling
    attention_mask = encoded["attention_mask"]
    token_embeddings = outputs.last_hidden_state
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    embeddings = torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    # Normalize
    embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
    return embeddings.numpy()

query_emb = embed_e5(["How does auth work?"], prefix="query: ")
doc_emb = embed_e5(["Authentication uses JWT tokens."], prefix="passage: ")

Dimensionality Choices

Matryoshka Representations

Models like OpenAI text-embedding-3 and Nomic Embed support truncating dimensions without retraining.

# Full dimensions vs. reduced
full = embed_texts(["test"], model="text-embedding-3-small")          # 1536 dims
half = embed_texts(["test"], model="text-embedding-3-small", dimensions=768)  # 768 dims
quarter = embed_texts(["test"], model="text-embedding-3-small", dimensions=384)  # 384 dims

Dimensions	Storage per 1M docs	Quality Impact	Use Case
1536	~6 GB	Baseline	High-quality production
768	~3 GB	~2-3% drop	Good balance
384	~1.5 GB	~5-8% drop	Cost-constrained, large corpus
256	~1 GB	~10-15% drop	Rough filtering only

Batch Processing

Efficient Batch Embedding

import time
from typing import List

def batch_embed(texts: List[str], batch_size: int = 100, model: str = "text-embedding-3-small"):
    """Embed texts in batches with rate limiting."""
    all_embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        embeddings = embed_texts(batch, model=model)
        all_embeddings.extend(embeddings)
        if i + batch_size < len(texts):
            time.sleep(0.1)  # Basic rate limiting
    return all_embeddings

# For large corpora: async batching
import asyncio
from openai import AsyncOpenAI

async_client = AsyncOpenAI()

async def async_embed_batch(texts: List[str], batch_size: int = 100):
    """Parallel async embedding."""
    async def embed_one_batch(batch):
        response = await async_client.embeddings.create(
            input=batch, model="text-embedding-3-small"
        )
        return [item.embedding for item in response.data]

    batches = [texts[i:i+batch_size] for i in range(0, len(texts), batch_size)]
    # Process 5 batches concurrently
    results = []
    for i in range(0, len(batches), 5):
        group = batches[i:i+5]
        group_results = await asyncio.gather(*[embed_one_batch(b) for b in group])
        for r in group_results:
            results.extend(r)
    return results

Embedding Cache

import hashlib
import json
import sqlite3
from typing import List, Optional

class EmbeddingCache:
    """SQLite-backed embedding cache to avoid re-embedding identical text."""

    def __init__(self, db_path: str = "embedding_cache.db", model: str = "text-embedding-3-small"):
        self.model = model
        self.conn = sqlite3.connect(db_path)
        self.conn.execute("""
            CREATE TABLE IF NOT EXISTS cache (
                text_hash TEXT PRIMARY KEY,
                model TEXT,
                embedding BLOB
            )
        """)

    def _hash(self, text: str) -> str:
        return hashlib.sha256(f"{self.model}:{text}".encode()).hexdigest()

    def get(self, text: str) -> Optional[List[float]]:
        row = self.conn.execute(
            "SELECT embedding FROM cache WHERE text_hash = ?",
            (self._hash(text),)
        ).fetchone()
        if row:
            return json.loads(row[0])
        return None

    def put(self, text: str, embedding: List[float]):
        self.conn.execute(
            "INSERT OR REPLACE INTO cache (text_hash, model, embedding) VALUES (?, ?, ?)",
            (self._hash(text), self.model, json.dumps(embedding))
        )
        self.conn.commit()

    def embed_with_cache(self, texts: List[str]) -> List[List[float]]:
        results = [None] * len(texts)
        uncached_indices = []

        for i, text in enumerate(texts):
            cached = self.get(text)
            if cached:
                results[i] = cached
            else:
                uncached_indices.append(i)

        if uncached_indices:
            uncached_texts = [texts[i] for i in uncached_indices]
            new_embeddings = embed_texts(uncached_texts, model=self.model)
            for idx, emb in zip(uncached_indices, new_embeddings):
                results[idx] = emb
                self.put(texts[idx], emb)

        return results

Fine-Tuning Embeddings

When off-the-shelf models underperform on your domain (medical, legal, niche technical).

from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

# Prepare training data: (query, positive_passage, negative_passage)
train_examples = [
    InputExample(texts=["What is OAuth2?", "OAuth2 is an authorization framework...", "Python is a programming language..."]),
    InputExample(texts=["JWT expiry", "JSON Web Tokens have a configurable expiry...", "CSS Grid layouts allow..."]),
]

model = SentenceTransformer("BAAI/bge-large-en-v1.5")
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.TripletLoss(model=model)

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=3,
    warmup_steps=100,
    output_path="./fine-tuned-embeddings",
)

When to fine-tune:

Domain-specific jargon not in general models (medical codes, legal citations)
Retrieval recall below 70% on your eval set with best off-the-shelf model
You have at least 1000 query-passage pairs for training

When NOT to fine-tune:

General knowledge domains (try better chunking first)
Fewer than 500 training examples
The bottleneck is generation, not retrieval

Anti-Patterns

Mixing embedding models -- Never embed queries with one model and documents with another. The vector spaces are incompatible.
Ignoring query/document prefixes -- BGE needs "Represent this sentence...", E5 needs "query:"/"passage:", Cohere needs input_type. Omitting these degrades quality by 5-15%.
Re-embedding unchanged documents -- Always cache embeddings. Re-indexing a 100K document corpus costs time and money for zero benefit.
Using max dimensions when unnecessary -- Matryoshka models let you trade 2-5% quality for 50-75% storage savings. Always benchmark reduced dimensions.
Embedding very long text without chunking -- Models have token limits (512-8192). Text beyond the limit is silently truncated, losing information.

Selection Checklist

Benchmark 2-3 models on your eval set before committing
Match query/document prefixes to model requirements
Choose dimensions based on corpus size and quality needs
Implement embedding cache before production indexing
Test multilingual support if your corpus is not English-only
Verify token limits match your chunk sizes
Calculate monthly embedding cost at expected query volume

Install this skill directly: skilldb add rag-pipeline-skills

Get CLI access →

Related Skills

advanced-rag

Advanced RAG patterns beyond basic retrieve-and-generate. Covers multi-hop RAG, agentic RAG with tool use, graph RAG (knowledge graphs + vector retrieval), recursive retrieval, self-querying retrievers, query decomposition, citation extraction, and corrective RAG. Includes implementation patterns and guidance on when each advanced technique is warranted.

Rag Pipeline•464L

chunking-strategies

Comprehensive guide to document chunking strategies for RAG pipelines. Covers fixed-size, semantic, recursive character, sentence-based, parent-child, markdown-aware, and code-aware chunking. Includes chunk size optimization, overlap strategies, and practical benchmarks for choosing the right approach based on document type and retrieval quality.

Rag Pipeline•343L

rag-evaluation

Evaluating RAG systems end-to-end. Covers retrieval metrics (context precision, context recall, MRR), generation metrics (faithfulness, answer relevance, hallucination detection), the RAGAS framework, human evaluation protocols, A/B testing retrieval strategies, building evaluation datasets, and continuous monitoring in production.

Rag Pipeline•501L

rag-fundamentals

Teaches the foundational architecture of Retrieval-Augmented Generation (RAG) systems. Covers why RAG outperforms fine-tuning for most knowledge-grounding use cases, the three core stages (indexing, retrieval, generation), component design, latency budgets, and evaluation metrics including faithfulness, relevance, and hallucination rate. Use when building or explaining any RAG system from scratch.

Rag Pipeline•266L

rag-production

Production-grade RAG deployment patterns. Covers caching strategies (semantic and exact), streaming responses, token budget management, fallback strategies for retrieval failures, monitoring retrieval quality, cost optimization, incremental indexing, multi-tenancy, and operational best practices for running RAG systems at scale.

Rag Pipeline•498L

rag-with-langchain

Building RAG pipelines with LangChain and LangGraph. Covers document loaders, text splitters, vector stores, retrievers, chains, and agents. Includes practical patterns for conversational RAG, multi-source retrieval, streaming, and LangGraph-based agentic RAG workflows.

Rag Pipeline•460L

Embedding Models

Model Landscape

Commercial Embedding APIs

Open-Source Models

Using Commercial APIs

OpenAI Embeddings

Single text

Reduced dimensions for cost/speed (Matryoshka)

Cohere Embeddings

Cohere supports input_type for better quality

Voyage AI

General embedding

Code-specific model

Using Open-Source Models

With sentence-transformers

BGE models need a query prefix for retrieval

Cosine similarity (normalized vectors, so dot product works)

With Hugging Face Transformers (direct)

Dimensionality Choices

Matryoshka Representations

Full dimensions vs. reduced

Batch Processing

Efficient Batch Embedding

For large corpora: async batching

Embedding Cache

Fine-Tuning Embeddings

Prepare training data: (query, positive_passage, negative_passage)

Anti-Patterns

Selection Checklist

Details

Pack: rag-pipeline-skills
File: embedding-models.md
Lines: 357
Category: Technology & Engineering

Download via CLI

Pro

$ skilldb add rag-pipeline-skills

Installs the full Rag Pipeline pack to your project.