Skip to main content
Technology & EngineeringRag Pipeline359 lines

retrieval-strategies

Covers retrieval strategies for RAG pipelines: dense retrieval, sparse retrieval (BM25), hybrid search, re-ranking with cross-encoders and Cohere Rerank, Maximal Marginal Relevance (MMR), contextual retrieval, and Hypothetical Document Embeddings (HyDE). Includes practical implementation patterns and guidance on when to use each strategy.

Quick Summary14 lines
Maximize the relevance and diversity of retrieved context for LLM generation.

## Key Points

- **Always** in production when quality matters -- re-ranking consistently improves retrieval by 5-15%
- **Skip** when latency budget is under 200ms total
- **Skip** when corpus is small (< 1000 chunks) and initial retrieval is already precise
1. **Dense-only retrieval** -- Missing exact keyword matches for IDs, error codes, product names. Always consider hybrid search.
2. **Re-ranking the final k results** -- Re-rank a larger candidate set (20-50), then take top-k. Re-ranking 5 results barely helps.
3. **Same k for all queries** -- Simple factual queries need k=2-3; complex multi-part queries need k=8-10. Consider dynamic k based on query complexity.
4. **Ignoring score thresholds** -- Returning low-similarity results pollutes the LLM context. Set a minimum similarity threshold and return fewer results when nothing relevant exists.
5. **Using HyDE for everything** -- HyDE adds latency (extra LLM call) and can mislead when queries are specific. Use it selectively.
skilldb get rag-pipeline-skills/retrieval-strategiesFull skill: 359 lines
Paste into your CLAUDE.md or agent config

Retrieval Strategies

Maximize the relevance and diversity of retrieved context for LLM generation.


Dense Retrieval (Vector Search)

Encode queries and documents into dense embedding vectors, retrieve by similarity.

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(chunks, embeddings)

# Basic similarity search
results = vectorstore.similarity_search("How does auth work?", k=5)

# With score threshold
results = vectorstore.similarity_search_with_relevance_scores(
    "How does auth work?",
    k=5,
    score_threshold=0.7  # Only return results above this similarity
)

Strengths: Captures semantic meaning, handles paraphrases, works across languages. Weaknesses: Misses exact keyword matches, struggles with rare terms, acronyms, IDs.


Sparse Retrieval (BM25)

Traditional keyword-based retrieval using term frequency and inverse document frequency.

from langchain.retrievers import BM25Retriever

# Build BM25 index from documents
bm25_retriever = BM25Retriever.from_documents(chunks, k=5)

# Query
results = bm25_retriever.invoke("JWT token expiration")

# Custom BM25 with rank-bm25 library
from rank_bm25 import BM25Okapi

corpus = [doc.page_content for doc in chunks]
tokenized_corpus = [doc.lower().split() for doc in corpus]
bm25 = BM25Okapi(tokenized_corpus)

query_tokens = "JWT token expiration".lower().split()
scores = bm25.get_scores(query_tokens)

# Get top-k indices
import numpy as np
top_k_indices = np.argsort(scores)[-5:][::-1]
top_results = [(chunks[i], scores[i]) for i in top_k_indices]

Strengths: Exact keyword matching, handles rare terms/IDs well, no embedding needed, fast. Weaknesses: No semantic understanding, misses synonyms and paraphrases.


Hybrid Search

Combine dense and sparse retrieval for the best of both worlds.

from langchain.retrievers import EnsembleRetriever

# Dense retriever
dense_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

# Sparse retriever
sparse_retriever = BM25Retriever.from_documents(chunks, k=10)

# Ensemble with Reciprocal Rank Fusion
hybrid_retriever = EnsembleRetriever(
    retrievers=[sparse_retriever, dense_retriever],
    weights=[0.3, 0.7]  # Favor dense, but keep BM25 signal
)

results = hybrid_retriever.invoke("JWT token expiration policy")

Custom Reciprocal Rank Fusion (RRF)

def reciprocal_rank_fusion(result_lists, k=60):
    """Fuse multiple ranked result lists using RRF."""
    scores = {}
    for result_list in result_lists:
        for rank, doc in enumerate(result_list):
            doc_id = doc.metadata.get("id", doc.page_content[:50])
            if doc_id not in scores:
                scores[doc_id] = {"doc": doc, "score": 0}
            scores[doc_id]["score"] += 1 / (k + rank + 1)

    # Sort by fused score
    ranked = sorted(scores.values(), key=lambda x: x["score"], reverse=True)
    return [item["doc"] for item in ranked]

# Usage
dense_results = vectorstore.similarity_search(query, k=10)
sparse_results = bm25_retriever.invoke(query)
fused = reciprocal_rank_fusion([dense_results, sparse_results])[:5]

Weight Tuning Guidelines

Query TypeDense WeightSparse WeightWhy
Conceptual questions0.80.2Semantics matter more
Exact term lookup0.20.8Keywords matter more
General mixed queries0.60.4Balanced
Code search0.40.6Function names, variables are keywords

Re-Ranking

Retrieve broadly (top-20-50), then re-rank with a more powerful model to get the best top-5.

Cohere Rerank

import cohere

co = cohere.Client()

def rerank_results(query, documents, top_n=5):
    """Re-rank documents using Cohere Rerank."""
    texts = [doc.page_content for doc in documents]
    response = co.rerank(
        query=query,
        documents=texts,
        model="rerank-english-v3.0",
        top_n=top_n,
    )
    reranked = []
    for result in response.results:
        doc = documents[result.index]
        doc.metadata["rerank_score"] = result.relevance_score
        reranked.append(doc)
    return reranked

# Pipeline: retrieve 20, rerank to 5
candidates = hybrid_retriever.invoke(query)[:20]
final_results = rerank_results(query, candidates, top_n=5)

Cross-Encoder Re-Ranking (Open Source)

from sentence_transformers import CrossEncoder

# Load cross-encoder model
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2")

def cross_encoder_rerank(query, documents, top_n=5):
    """Re-rank using a cross-encoder model."""
    pairs = [(query, doc.page_content) for doc in documents]
    scores = reranker.predict(pairs)

    scored_docs = list(zip(documents, scores))
    scored_docs.sort(key=lambda x: x[1], reverse=True)

    return [doc for doc, score in scored_docs[:top_n]]

# More powerful models (slower but better):
# cross-encoder/ms-marco-MiniLM-L-12-v2  (fast, good)
# BAAI/bge-reranker-v2-m3                (multilingual, strong)
# BAAI/bge-reranker-v2-gemma             (best quality, slow)

When to Re-Rank

  • Always in production when quality matters -- re-ranking consistently improves retrieval by 5-15%
  • Skip when latency budget is under 200ms total
  • Skip when corpus is small (< 1000 chunks) and initial retrieval is already precise

Maximal Marginal Relevance (MMR)

Reduce redundancy in results by balancing relevance with diversity.

# Built into most vector stores
results = vectorstore.max_marginal_relevance_search(
    query="authentication methods",
    k=5,           # Return 5 results
    fetch_k=20,    # Consider 20 candidates
    lambda_mult=0.7  # 0 = max diversity, 1 = max relevance
)

# Manual MMR implementation
import numpy as np

def mmr(query_embedding, candidate_embeddings, candidate_docs, k=5, lambda_mult=0.7):
    """Maximal Marginal Relevance selection."""
    query_sim = np.dot(candidate_embeddings, query_embedding)
    selected = []
    remaining = list(range(len(candidate_docs)))

    for _ in range(k):
        best_score = -np.inf
        best_idx = None

        for idx in remaining:
            relevance = query_sim[idx]
            if selected:
                selected_embeddings = candidate_embeddings[selected]
                redundancy = max(np.dot(selected_embeddings, candidate_embeddings[idx]))
            else:
                redundancy = 0

            mmr_score = lambda_mult * relevance - (1 - lambda_mult) * redundancy
            if mmr_score > best_score:
                best_score = mmr_score
                best_idx = idx

        selected.append(best_idx)
        remaining.remove(best_idx)

    return [candidate_docs[i] for i in selected]

Use when: Retrieved chunks are repetitive, documents have overlapping content, you need diverse perspectives.


Contextual Retrieval

Prepend document-level context to each chunk before embedding to reduce ambiguity.

def add_contextual_prefix(chunk_text, document_title, section_header):
    """Prepend context to reduce chunk ambiguity."""
    prefix = f"From document: {document_title}"
    if section_header:
        prefix += f", section: {section_header}"
    return f"{prefix}\n\n{chunk_text}"

# Example: a chunk saying "It supports OAuth2" is ambiguous
# After: "From document: API Gateway Guide, section: Authentication\n\nIt supports OAuth2"

# Anthropic's Contextual Retrieval approach: use LLM to generate context
def generate_chunk_context(chunk_text, full_document, llm):
    """Use LLM to generate situating context for a chunk."""
    prompt = f"""Here is the full document:
<document>
{full_document[:5000]}
</document>

Here is a chunk from that document:
<chunk>
{chunk_text}
</chunk>

Give a short (1-2 sentence) context that situates this chunk within the document.
Focus on what the chunk is about and what document it comes from."""

    return llm.invoke(prompt).content

# Embed the contextualized chunk
contextualized = generate_chunk_context(chunk, document, llm)
text_to_embed = f"{contextualized}\n\n{chunk}"

HyDE (Hypothetical Document Embeddings)

Generate a hypothetical answer to the query, embed that instead of the raw query.

from langchain.chat_models import ChatOpenAI
from langchain.embeddings import OpenAIEmbeddings

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

def hyde_retrieval(query, vectorstore, k=5):
    """Generate hypothetical document, then search with its embedding."""
    # Step 1: Generate hypothetical answer
    hypothetical = llm.invoke(
        f"Write a short paragraph that would answer this question: {query}"
    ).content

    # Step 2: Embed the hypothetical answer (not the query)
    hyde_embedding = embeddings.embed_query(hypothetical)

    # Step 3: Search with hypothetical embedding
    results = vectorstore.similarity_search_by_vector(hyde_embedding, k=k)
    return results

# HyDE works because the hypothetical answer is closer in embedding space
# to the actual documents than the original question is

When HyDE helps: Vague queries, questions with different vocabulary than the documents. When HyDE hurts: Factual lookups, queries with specific terms that must match exactly.


Putting It All Together: Production Retrieval Pipeline

class ProductionRetriever:
    def __init__(self, vectorstore, chunks, reranker, llm):
        self.dense = vectorstore.as_retriever(search_kwargs={"k": 20})
        self.sparse = BM25Retriever.from_documents(chunks, k=20)
        self.reranker = reranker
        self.llm = llm

    def retrieve(self, query, top_k=5, use_hyde=False):
        # Optional: HyDE for vague queries
        search_query = query
        if use_hyde:
            hypothetical = self.llm.invoke(
                f"Write a paragraph answering: {query}"
            ).content
            search_query = hypothetical

        # Hybrid retrieval
        dense_results = self.dense.invoke(search_query)
        sparse_results = self.sparse.invoke(query)  # Always use original for BM25
        candidates = reciprocal_rank_fusion([dense_results, sparse_results])[:20]

        # Re-rank
        final = cross_encoder_rerank(query, candidates, top_n=top_k)
        return final

Anti-Patterns

  1. Dense-only retrieval -- Missing exact keyword matches for IDs, error codes, product names. Always consider hybrid search.

  2. Re-ranking the final k results -- Re-rank a larger candidate set (20-50), then take top-k. Re-ranking 5 results barely helps.

  3. Same k for all queries -- Simple factual queries need k=2-3; complex multi-part queries need k=8-10. Consider dynamic k based on query complexity.

  4. Ignoring score thresholds -- Returning low-similarity results pollutes the LLM context. Set a minimum similarity threshold and return fewer results when nothing relevant exists.

  5. Using HyDE for everything -- HyDE adds latency (extra LLM call) and can mislead when queries are specific. Use it selectively.

Install this skill directly: skilldb add rag-pipeline-skills

Get CLI access →

Related Skills

advanced-rag

Advanced RAG patterns beyond basic retrieve-and-generate. Covers multi-hop RAG, agentic RAG with tool use, graph RAG (knowledge graphs + vector retrieval), recursive retrieval, self-querying retrievers, query decomposition, citation extraction, and corrective RAG. Includes implementation patterns and guidance on when each advanced technique is warranted.

Rag Pipeline464L

chunking-strategies

Comprehensive guide to document chunking strategies for RAG pipelines. Covers fixed-size, semantic, recursive character, sentence-based, parent-child, markdown-aware, and code-aware chunking. Includes chunk size optimization, overlap strategies, and practical benchmarks for choosing the right approach based on document type and retrieval quality.

Rag Pipeline343L

embedding-models

Guide to selecting, using, and optimizing text embedding models for RAG pipelines. Covers commercial models (OpenAI text-embedding-3, Cohere embed-v3, Voyage AI) and open-source options (BGE, E5, Nomic Embed). Includes dimensionality selection, batch processing, embedding caching, fine-tuning for domain-specific retrieval, and cost analysis.

Rag Pipeline357L

rag-evaluation

Evaluating RAG systems end-to-end. Covers retrieval metrics (context precision, context recall, MRR), generation metrics (faithfulness, answer relevance, hallucination detection), the RAGAS framework, human evaluation protocols, A/B testing retrieval strategies, building evaluation datasets, and continuous monitoring in production.

Rag Pipeline501L

rag-fundamentals

Teaches the foundational architecture of Retrieval-Augmented Generation (RAG) systems. Covers why RAG outperforms fine-tuning for most knowledge-grounding use cases, the three core stages (indexing, retrieval, generation), component design, latency budgets, and evaluation metrics including faithfulness, relevance, and hallucination rate. Use when building or explaining any RAG system from scratch.

Rag Pipeline266L

rag-production

Production-grade RAG deployment patterns. Covers caching strategies (semantic and exact), streaming responses, token budget management, fallback strategies for retrieval failures, monitoring retrieval quality, cost optimization, incremental indexing, multi-tenancy, and operational best practices for running RAG systems at scale.

Rag Pipeline498L