retrieval-strategies
Covers retrieval strategies for RAG pipelines: dense retrieval, sparse retrieval (BM25), hybrid search, re-ranking with cross-encoders and Cohere Rerank, Maximal Marginal Relevance (MMR), contextual retrieval, and Hypothetical Document Embeddings (HyDE). Includes practical implementation patterns and guidance on when to use each strategy.
Maximize the relevance and diversity of retrieved context for LLM generation. ## Key Points - **Always** in production when quality matters -- re-ranking consistently improves retrieval by 5-15% - **Skip** when latency budget is under 200ms total - **Skip** when corpus is small (< 1000 chunks) and initial retrieval is already precise 1. **Dense-only retrieval** -- Missing exact keyword matches for IDs, error codes, product names. Always consider hybrid search. 2. **Re-ranking the final k results** -- Re-rank a larger candidate set (20-50), then take top-k. Re-ranking 5 results barely helps. 3. **Same k for all queries** -- Simple factual queries need k=2-3; complex multi-part queries need k=8-10. Consider dynamic k based on query complexity. 4. **Ignoring score thresholds** -- Returning low-similarity results pollutes the LLM context. Set a minimum similarity threshold and return fewer results when nothing relevant exists. 5. **Using HyDE for everything** -- HyDE adds latency (extra LLM call) and can mislead when queries are specific. Use it selectively.
skilldb get rag-pipeline-skills/retrieval-strategiesFull skill: 359 linesRetrieval Strategies
Maximize the relevance and diversity of retrieved context for LLM generation.
Dense Retrieval (Vector Search)
Encode queries and documents into dense embedding vectors, retrieve by similarity.
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(chunks, embeddings)
# Basic similarity search
results = vectorstore.similarity_search("How does auth work?", k=5)
# With score threshold
results = vectorstore.similarity_search_with_relevance_scores(
"How does auth work?",
k=5,
score_threshold=0.7 # Only return results above this similarity
)
Strengths: Captures semantic meaning, handles paraphrases, works across languages. Weaknesses: Misses exact keyword matches, struggles with rare terms, acronyms, IDs.
Sparse Retrieval (BM25)
Traditional keyword-based retrieval using term frequency and inverse document frequency.
from langchain.retrievers import BM25Retriever
# Build BM25 index from documents
bm25_retriever = BM25Retriever.from_documents(chunks, k=5)
# Query
results = bm25_retriever.invoke("JWT token expiration")
# Custom BM25 with rank-bm25 library
from rank_bm25 import BM25Okapi
corpus = [doc.page_content for doc in chunks]
tokenized_corpus = [doc.lower().split() for doc in corpus]
bm25 = BM25Okapi(tokenized_corpus)
query_tokens = "JWT token expiration".lower().split()
scores = bm25.get_scores(query_tokens)
# Get top-k indices
import numpy as np
top_k_indices = np.argsort(scores)[-5:][::-1]
top_results = [(chunks[i], scores[i]) for i in top_k_indices]
Strengths: Exact keyword matching, handles rare terms/IDs well, no embedding needed, fast. Weaknesses: No semantic understanding, misses synonyms and paraphrases.
Hybrid Search
Combine dense and sparse retrieval for the best of both worlds.
from langchain.retrievers import EnsembleRetriever
# Dense retriever
dense_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})
# Sparse retriever
sparse_retriever = BM25Retriever.from_documents(chunks, k=10)
# Ensemble with Reciprocal Rank Fusion
hybrid_retriever = EnsembleRetriever(
retrievers=[sparse_retriever, dense_retriever],
weights=[0.3, 0.7] # Favor dense, but keep BM25 signal
)
results = hybrid_retriever.invoke("JWT token expiration policy")
Custom Reciprocal Rank Fusion (RRF)
def reciprocal_rank_fusion(result_lists, k=60):
"""Fuse multiple ranked result lists using RRF."""
scores = {}
for result_list in result_lists:
for rank, doc in enumerate(result_list):
doc_id = doc.metadata.get("id", doc.page_content[:50])
if doc_id not in scores:
scores[doc_id] = {"doc": doc, "score": 0}
scores[doc_id]["score"] += 1 / (k + rank + 1)
# Sort by fused score
ranked = sorted(scores.values(), key=lambda x: x["score"], reverse=True)
return [item["doc"] for item in ranked]
# Usage
dense_results = vectorstore.similarity_search(query, k=10)
sparse_results = bm25_retriever.invoke(query)
fused = reciprocal_rank_fusion([dense_results, sparse_results])[:5]
Weight Tuning Guidelines
| Query Type | Dense Weight | Sparse Weight | Why |
|---|---|---|---|
| Conceptual questions | 0.8 | 0.2 | Semantics matter more |
| Exact term lookup | 0.2 | 0.8 | Keywords matter more |
| General mixed queries | 0.6 | 0.4 | Balanced |
| Code search | 0.4 | 0.6 | Function names, variables are keywords |
Re-Ranking
Retrieve broadly (top-20-50), then re-rank with a more powerful model to get the best top-5.
Cohere Rerank
import cohere
co = cohere.Client()
def rerank_results(query, documents, top_n=5):
"""Re-rank documents using Cohere Rerank."""
texts = [doc.page_content for doc in documents]
response = co.rerank(
query=query,
documents=texts,
model="rerank-english-v3.0",
top_n=top_n,
)
reranked = []
for result in response.results:
doc = documents[result.index]
doc.metadata["rerank_score"] = result.relevance_score
reranked.append(doc)
return reranked
# Pipeline: retrieve 20, rerank to 5
candidates = hybrid_retriever.invoke(query)[:20]
final_results = rerank_results(query, candidates, top_n=5)
Cross-Encoder Re-Ranking (Open Source)
from sentence_transformers import CrossEncoder
# Load cross-encoder model
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2")
def cross_encoder_rerank(query, documents, top_n=5):
"""Re-rank using a cross-encoder model."""
pairs = [(query, doc.page_content) for doc in documents]
scores = reranker.predict(pairs)
scored_docs = list(zip(documents, scores))
scored_docs.sort(key=lambda x: x[1], reverse=True)
return [doc for doc, score in scored_docs[:top_n]]
# More powerful models (slower but better):
# cross-encoder/ms-marco-MiniLM-L-12-v2 (fast, good)
# BAAI/bge-reranker-v2-m3 (multilingual, strong)
# BAAI/bge-reranker-v2-gemma (best quality, slow)
When to Re-Rank
- Always in production when quality matters -- re-ranking consistently improves retrieval by 5-15%
- Skip when latency budget is under 200ms total
- Skip when corpus is small (< 1000 chunks) and initial retrieval is already precise
Maximal Marginal Relevance (MMR)
Reduce redundancy in results by balancing relevance with diversity.
# Built into most vector stores
results = vectorstore.max_marginal_relevance_search(
query="authentication methods",
k=5, # Return 5 results
fetch_k=20, # Consider 20 candidates
lambda_mult=0.7 # 0 = max diversity, 1 = max relevance
)
# Manual MMR implementation
import numpy as np
def mmr(query_embedding, candidate_embeddings, candidate_docs, k=5, lambda_mult=0.7):
"""Maximal Marginal Relevance selection."""
query_sim = np.dot(candidate_embeddings, query_embedding)
selected = []
remaining = list(range(len(candidate_docs)))
for _ in range(k):
best_score = -np.inf
best_idx = None
for idx in remaining:
relevance = query_sim[idx]
if selected:
selected_embeddings = candidate_embeddings[selected]
redundancy = max(np.dot(selected_embeddings, candidate_embeddings[idx]))
else:
redundancy = 0
mmr_score = lambda_mult * relevance - (1 - lambda_mult) * redundancy
if mmr_score > best_score:
best_score = mmr_score
best_idx = idx
selected.append(best_idx)
remaining.remove(best_idx)
return [candidate_docs[i] for i in selected]
Use when: Retrieved chunks are repetitive, documents have overlapping content, you need diverse perspectives.
Contextual Retrieval
Prepend document-level context to each chunk before embedding to reduce ambiguity.
def add_contextual_prefix(chunk_text, document_title, section_header):
"""Prepend context to reduce chunk ambiguity."""
prefix = f"From document: {document_title}"
if section_header:
prefix += f", section: {section_header}"
return f"{prefix}\n\n{chunk_text}"
# Example: a chunk saying "It supports OAuth2" is ambiguous
# After: "From document: API Gateway Guide, section: Authentication\n\nIt supports OAuth2"
# Anthropic's Contextual Retrieval approach: use LLM to generate context
def generate_chunk_context(chunk_text, full_document, llm):
"""Use LLM to generate situating context for a chunk."""
prompt = f"""Here is the full document:
<document>
{full_document[:5000]}
</document>
Here is a chunk from that document:
<chunk>
{chunk_text}
</chunk>
Give a short (1-2 sentence) context that situates this chunk within the document.
Focus on what the chunk is about and what document it comes from."""
return llm.invoke(prompt).content
# Embed the contextualized chunk
contextualized = generate_chunk_context(chunk, document, llm)
text_to_embed = f"{contextualized}\n\n{chunk}"
HyDE (Hypothetical Document Embeddings)
Generate a hypothetical answer to the query, embed that instead of the raw query.
from langchain.chat_models import ChatOpenAI
from langchain.embeddings import OpenAIEmbeddings
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
def hyde_retrieval(query, vectorstore, k=5):
"""Generate hypothetical document, then search with its embedding."""
# Step 1: Generate hypothetical answer
hypothetical = llm.invoke(
f"Write a short paragraph that would answer this question: {query}"
).content
# Step 2: Embed the hypothetical answer (not the query)
hyde_embedding = embeddings.embed_query(hypothetical)
# Step 3: Search with hypothetical embedding
results = vectorstore.similarity_search_by_vector(hyde_embedding, k=k)
return results
# HyDE works because the hypothetical answer is closer in embedding space
# to the actual documents than the original question is
When HyDE helps: Vague queries, questions with different vocabulary than the documents. When HyDE hurts: Factual lookups, queries with specific terms that must match exactly.
Putting It All Together: Production Retrieval Pipeline
class ProductionRetriever:
def __init__(self, vectorstore, chunks, reranker, llm):
self.dense = vectorstore.as_retriever(search_kwargs={"k": 20})
self.sparse = BM25Retriever.from_documents(chunks, k=20)
self.reranker = reranker
self.llm = llm
def retrieve(self, query, top_k=5, use_hyde=False):
# Optional: HyDE for vague queries
search_query = query
if use_hyde:
hypothetical = self.llm.invoke(
f"Write a paragraph answering: {query}"
).content
search_query = hypothetical
# Hybrid retrieval
dense_results = self.dense.invoke(search_query)
sparse_results = self.sparse.invoke(query) # Always use original for BM25
candidates = reciprocal_rank_fusion([dense_results, sparse_results])[:20]
# Re-rank
final = cross_encoder_rerank(query, candidates, top_n=top_k)
return final
Anti-Patterns
-
Dense-only retrieval -- Missing exact keyword matches for IDs, error codes, product names. Always consider hybrid search.
-
Re-ranking the final k results -- Re-rank a larger candidate set (20-50), then take top-k. Re-ranking 5 results barely helps.
-
Same k for all queries -- Simple factual queries need k=2-3; complex multi-part queries need k=8-10. Consider dynamic k based on query complexity.
-
Ignoring score thresholds -- Returning low-similarity results pollutes the LLM context. Set a minimum similarity threshold and return fewer results when nothing relevant exists.
-
Using HyDE for everything -- HyDE adds latency (extra LLM call) and can mislead when queries are specific. Use it selectively.
Install this skill directly: skilldb add rag-pipeline-skills
Related Skills
advanced-rag
Advanced RAG patterns beyond basic retrieve-and-generate. Covers multi-hop RAG, agentic RAG with tool use, graph RAG (knowledge graphs + vector retrieval), recursive retrieval, self-querying retrievers, query decomposition, citation extraction, and corrective RAG. Includes implementation patterns and guidance on when each advanced technique is warranted.
chunking-strategies
Comprehensive guide to document chunking strategies for RAG pipelines. Covers fixed-size, semantic, recursive character, sentence-based, parent-child, markdown-aware, and code-aware chunking. Includes chunk size optimization, overlap strategies, and practical benchmarks for choosing the right approach based on document type and retrieval quality.
embedding-models
Guide to selecting, using, and optimizing text embedding models for RAG pipelines. Covers commercial models (OpenAI text-embedding-3, Cohere embed-v3, Voyage AI) and open-source options (BGE, E5, Nomic Embed). Includes dimensionality selection, batch processing, embedding caching, fine-tuning for domain-specific retrieval, and cost analysis.
rag-evaluation
Evaluating RAG systems end-to-end. Covers retrieval metrics (context precision, context recall, MRR), generation metrics (faithfulness, answer relevance, hallucination detection), the RAGAS framework, human evaluation protocols, A/B testing retrieval strategies, building evaluation datasets, and continuous monitoring in production.
rag-fundamentals
Teaches the foundational architecture of Retrieval-Augmented Generation (RAG) systems. Covers why RAG outperforms fine-tuning for most knowledge-grounding use cases, the three core stages (indexing, retrieval, generation), component design, latency budgets, and evaluation metrics including faithfulness, relevance, and hallucination rate. Use when building or explaining any RAG system from scratch.
rag-production
Production-grade RAG deployment patterns. Covers caching strategies (semantic and exact), streaming responses, token budget management, fallback strategies for retrieval failures, monitoring retrieval quality, cost optimization, incremental indexing, multi-tenancy, and operational best practices for running RAG systems at scale.