rag-evaluation
Evaluating RAG systems end-to-end. Covers retrieval metrics (context precision, context recall, MRR), generation metrics (faithfulness, answer relevance, hallucination detection), the RAGAS framework, human evaluation protocols, A/B testing retrieval strategies, building evaluation datasets, and continuous monitoring in production.
Measure and improve every stage of your RAG pipeline with systematic evaluation. ## Key Points 1. **Retrieval** can return irrelevant or incomplete context 2. **Generation** can hallucinate, ignore context, or give irrelevant answers - SUPPORTED: directly supported by context - NOT_SUPPORTED: not found in or contradicts context - GROUNDED: supported by the context - HALLUCINATED: not in the context 1. **Evaluating only end-to-end** -- If answers are bad, you need to know if retrieval or generation is the bottleneck. Always measure both separately. 2. **Using the same LLM as judge and generator** -- The evaluator LLM may have the same blind spots. Use a different model or human eval for critical assessments. 3. **Eval set too small** -- 10 questions is not an eval set. Aim for 50 minimum, 200+ for statistical significance across categories. 4. **Not stratifying eval questions** -- Group questions by type (factual, comparative, procedural, etc.) and by difficulty. Overall averages hide category-specific failures. 5. **One-time evaluation** -- RAG quality drifts as documents change. Run evaluations weekly or on every index update. 6. **Ignoring low-confidence retrievals** -- Monitor the distribution of similarity scores. A spike in low scores means your corpus may be missing information or your queries are shifting.
skilldb get rag-pipeline-skills/rag-evaluationFull skill: 501 linesRAG Evaluation
Measure and improve every stage of your RAG pipeline with systematic evaluation.
Why RAG Evaluation Is Hard
RAG has two components that can fail independently:
- Retrieval can return irrelevant or incomplete context
- Generation can hallucinate, ignore context, or give irrelevant answers
You need separate metrics for each stage plus end-to-end metrics.
Building an Evaluation Dataset
Start here. Without a good eval set, you cannot measure anything.
# Minimum viable eval set: 50-100 examples
eval_dataset = [
{
"question": "How do I reset my password?",
"ground_truth": "Go to Settings > Security > Reset Password. You'll receive an email link.",
"expected_sources": ["user-guide.md"], # For retrieval eval
},
{
"question": "What authentication methods are supported?",
"ground_truth": "OAuth2, API keys, and SAML SSO are supported.",
"expected_sources": ["auth-docs.md", "sso-guide.md"],
},
]
# Generate eval questions from your corpus using LLM
def generate_eval_questions(chunks, llm, n_questions=50):
"""Generate question-answer pairs from document chunks."""
qa_pairs = []
import random
sampled = random.sample(chunks, min(len(chunks), n_questions * 2))
for chunk in sampled:
if len(qa_pairs) >= n_questions:
break
response = llm.invoke(
f"""Generate a question-answer pair from this text.
The question should be something a user would naturally ask.
The answer should be directly supported by the text.
Text: {chunk.page_content}
Format:
Question: <question>
Answer: <answer>"""
).content
lines = response.strip().split("\n")
q = a = ""
for line in lines:
if line.startswith("Question:"):
q = line.replace("Question:", "").strip()
elif line.startswith("Answer:"):
a = line.replace("Answer:", "").strip()
if q and a:
qa_pairs.append({
"question": q,
"ground_truth": a,
"expected_sources": [chunk.metadata.get("source", "unknown")],
})
return qa_pairs
Retrieval Metrics
Context Precision
Are the top-k retrieved chunks actually relevant?
def context_precision(retrieved_docs, expected_sources, k=5):
"""What fraction of retrieved docs are from expected sources?"""
relevant = 0
for doc in retrieved_docs[:k]:
source = doc.metadata.get("source", "")
if source in expected_sources:
relevant += 1
return relevant / k
# Example: 3 out of 5 retrieved from expected sources = 0.6 precision
Context Recall
Did we retrieve all the information needed to answer?
def context_recall(retrieved_docs, ground_truth, llm):
"""Can the ground truth answer be derived from retrieved context?"""
context = "\n".join(d.page_content for d in retrieved_docs)
response = llm.invoke(
f"""Given this context, determine what fraction of the ground truth
answer is supported by the context.
Context: {context}
Ground truth answer: {ground_truth}
Score from 0.0 to 1.0 where 1.0 means fully supported:"""
).content
try:
return float(response.strip())
except ValueError:
return 0.0
Mean Reciprocal Rank (MRR)
How high does the first relevant result rank?
def mean_reciprocal_rank(queries_results):
"""Calculate MRR across multiple queries.
queries_results: list of (retrieved_docs, expected_sources) tuples
"""
reciprocal_ranks = []
for retrieved_docs, expected_sources in queries_results:
for rank, doc in enumerate(retrieved_docs, 1):
if doc.metadata.get("source", "") in expected_sources:
reciprocal_ranks.append(1 / rank)
break
else:
reciprocal_ranks.append(0)
return sum(reciprocal_ranks) / len(reciprocal_ranks)
Normalized Discounted Cumulative Gain (nDCG)
import numpy as np
def ndcg_at_k(retrieved_docs, expected_sources, k=5):
"""Calculate nDCG@k for a single query."""
relevance = []
for doc in retrieved_docs[:k]:
if doc.metadata.get("source", "") in expected_sources:
relevance.append(1)
else:
relevance.append(0)
# DCG
dcg = sum(rel / np.log2(i + 2) for i, rel in enumerate(relevance))
# Ideal DCG (all relevant docs at top)
ideal_relevance = sorted(relevance, reverse=True)
idcg = sum(rel / np.log2(i + 2) for i, rel in enumerate(ideal_relevance))
return dcg / idcg if idcg > 0 else 0
Generation Metrics
Faithfulness
Does the answer only use information from the retrieved context?
def evaluate_faithfulness(answer, context, llm):
"""Score how faithful the answer is to the provided context."""
prompt = f"""Evaluate whether the answer is faithful to the context.
A faithful answer only contains information that can be derived from the context.
Context: {context}
Answer: {answer}
For each sentence in the answer, determine if it is:
- SUPPORTED: directly supported by context
- NOT_SUPPORTED: not found in or contradicts context
Then give an overall faithfulness score from 0.0 to 1.0.
Evaluation:"""
response = llm.invoke(prompt).content
# Parse score from response
lines = response.strip().split("\n")
for line in reversed(lines):
try:
score = float(line.strip().split()[-1])
if 0 <= score <= 1:
return score
except (ValueError, IndexError):
continue
return 0.5 # Fallback
Answer Relevance
Does the answer actually address the question?
def evaluate_answer_relevance(question, answer, llm):
"""Score how relevant the answer is to the question."""
prompt = f"""Rate how well the answer addresses the question.
Score 1.0 if perfectly relevant, 0.0 if completely irrelevant.
Question: {question}
Answer: {answer}
Relevance score (0.0 to 1.0):"""
response = llm.invoke(prompt).content
try:
return float(response.strip())
except ValueError:
return 0.5
Hallucination Detection
def detect_hallucinations(answer, context, llm):
"""Identify specific hallucinated claims in the answer."""
prompt = f"""Analyze the answer for hallucinations (claims not supported by context).
Context:
{context}
Answer:
{answer}
List each claim in the answer and mark it as:
- GROUNDED: supported by the context
- HALLUCINATED: not in the context
Claims:"""
response = llm.invoke(prompt).content
hallucinated = []
grounded = []
for line in response.strip().split("\n"):
if "HALLUCINATED" in line.upper():
hallucinated.append(line.split(":")[0].strip() if ":" in line else line)
elif "GROUNDED" in line.upper():
grounded.append(line.split(":")[0].strip() if ":" in line else line)
total = len(hallucinated) + len(grounded)
hallucination_rate = len(hallucinated) / total if total > 0 else 0
return {
"hallucination_rate": hallucination_rate,
"hallucinated_claims": hallucinated,
"grounded_claims": grounded,
}
RAGAS Framework
The standard framework for RAG evaluation.
# pip install ragas
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
answer_correctness,
)
from datasets import Dataset
# Prepare dataset in RAGAS format
eval_data = {
"question": [
"How do I reset my password?",
"What auth methods are supported?",
],
"answer": [
"Go to Settings > Security > Reset Password.", # Generated answer
"We support OAuth2 and API keys.", # Generated answer
],
"contexts": [
["To reset your password, navigate to Settings > Security > Reset Password. A link will be emailed."],
["Authentication methods include OAuth2, API keys, and SAML SSO."],
],
"ground_truth": [
"Go to Settings > Security > Reset Password. You'll receive an email link.",
"OAuth2, API keys, and SAML SSO are supported.",
],
}
dataset = Dataset.from_dict(eval_data)
# Run evaluation
results = evaluate(
dataset=dataset,
metrics=[
faithfulness,
answer_relevancy,
context_precision,
context_recall,
answer_correctness,
],
)
print(results)
# {'faithfulness': 0.92, 'answer_relevancy': 0.88, 'context_precision': 0.85,
# 'context_recall': 0.78, 'answer_correctness': 0.83}
# Per-question results
df = results.to_pandas()
print(df)
RAGAS Metrics Explained
| Metric | Measures | Needs | Range |
|---|---|---|---|
| faithfulness | Answer grounded in context? | answer, contexts | 0-1 |
| answer_relevancy | Answer addresses question? | question, answer | 0-1 |
| context_precision | Retrieved docs relevant? | question, contexts, ground_truth | 0-1 |
| context_recall | All needed info retrieved? | contexts, ground_truth | 0-1 |
| answer_correctness | Answer matches ground truth? | answer, ground_truth | 0-1 |
A/B Testing Retrieval Strategies
import random
from collections import defaultdict
class RetrievalABTest:
"""A/B test different retrieval strategies."""
def __init__(self, strategies: dict, eval_questions: list):
self.strategies = strategies # {"name": retriever}
self.eval_questions = eval_questions
self.results = defaultdict(list)
def run(self):
for item in self.eval_questions:
question = item["question"]
expected = item["expected_sources"]
ground_truth = item["ground_truth"]
for name, retriever in self.strategies.items():
docs = retriever.invoke(question)
precision = context_precision(docs, expected)
self.results[name].append({
"question": question,
"precision": precision,
"num_results": len(docs),
})
def summary(self):
for name, metrics in self.results.items():
avg_precision = sum(m["precision"] for m in metrics) / len(metrics)
print(f"{name}: avg precision = {avg_precision:.3f}")
# Usage
test = RetrievalABTest(
strategies={
"dense_only": dense_retriever,
"bm25_only": bm25_retriever,
"hybrid_30_70": hybrid_retriever,
"hybrid_50_50": hybrid_50_50_retriever,
},
eval_questions=eval_dataset,
)
test.run()
test.summary()
Human Evaluation Protocol
Automated metrics cannot catch everything. Build a human eval loop.
# Generate evaluation samples
def create_human_eval_batch(questions, rag_chain, n=20):
"""Create a batch for human evaluation."""
import random
sampled = random.sample(questions, min(n, len(questions)))
batch = []
for q in sampled:
result = rag_chain.invoke(q["question"])
batch.append({
"id": len(batch),
"question": q["question"],
"generated_answer": result["result"] if isinstance(result, dict) else result,
"retrieved_sources": [d.metadata.get("source") for d in result.get("source_documents", [])],
# Human fills these in:
"correctness": None, # 1-5 scale
"completeness": None, # 1-5 scale
"harmfulness": None, # yes/no
"citation_accuracy": None, # 1-5 scale
"notes": "",
})
return batch
# Scoring rubric:
# Correctness (1-5): 1=wrong, 3=partially correct, 5=fully correct
# Completeness (1-5): 1=missing key info, 3=adequate, 5=comprehensive
# Citation accuracy (1-5): 1=wrong citations, 3=some correct, 5=all correct
Inter-Annotator Agreement
from sklearn.metrics import cohen_kappa_score
def inter_annotator_agreement(annotator1_scores, annotator2_scores):
"""Calculate Cohen's Kappa between two annotators."""
kappa = cohen_kappa_score(annotator1_scores, annotator2_scores)
print(f"Cohen's Kappa: {kappa:.3f}")
# < 0.20 = poor, 0.21-0.40 = fair, 0.41-0.60 = moderate,
# 0.61-0.80 = substantial, 0.81-1.00 = almost perfect
return kappa
Continuous Evaluation in Production
import logging
from datetime import datetime
class RAGMonitor:
"""Monitor RAG quality in production."""
def __init__(self, db_connection):
self.db = db_connection
def log_interaction(self, question, answer, retrieved_docs, latency_ms, user_feedback=None):
"""Log every RAG interaction for later analysis."""
self.db.insert("rag_interactions", {
"timestamp": datetime.utcnow().isoformat(),
"question": question,
"answer": answer,
"num_docs_retrieved": len(retrieved_docs),
"top_similarity_score": retrieved_docs[0].metadata.get("score") if retrieved_docs else None,
"sources": [d.metadata.get("source") for d in retrieved_docs],
"latency_ms": latency_ms,
"user_feedback": user_feedback, # thumbs up/down
})
def get_quality_report(self, days=7):
"""Generate quality report for the last N days."""
interactions = self.db.query(f"SELECT * FROM rag_interactions WHERE timestamp > NOW() - INTERVAL '{days} days'")
total = len(interactions)
thumbs_up = sum(1 for i in interactions if i["user_feedback"] == "positive")
thumbs_down = sum(1 for i in interactions if i["user_feedback"] == "negative")
no_feedback = total - thumbs_up - thumbs_down
avg_latency = sum(i["latency_ms"] for i in interactions) / total
return {
"total_queries": total,
"satisfaction_rate": thumbs_up / (thumbs_up + thumbs_down) if (thumbs_up + thumbs_down) > 0 else None,
"feedback_rate": (thumbs_up + thumbs_down) / total,
"avg_latency_ms": avg_latency,
"p95_latency_ms": sorted([i["latency_ms"] for i in interactions])[int(total * 0.95)],
}
Anti-Patterns
-
Evaluating only end-to-end -- If answers are bad, you need to know if retrieval or generation is the bottleneck. Always measure both separately.
-
Using the same LLM as judge and generator -- The evaluator LLM may have the same blind spots. Use a different model or human eval for critical assessments.
-
Eval set too small -- 10 questions is not an eval set. Aim for 50 minimum, 200+ for statistical significance across categories.
-
Not stratifying eval questions -- Group questions by type (factual, comparative, procedural, etc.) and by difficulty. Overall averages hide category-specific failures.
-
One-time evaluation -- RAG quality drifts as documents change. Run evaluations weekly or on every index update.
-
Ignoring low-confidence retrievals -- Monitor the distribution of similarity scores. A spike in low scores means your corpus may be missing information or your queries are shifting.
Install this skill directly: skilldb add rag-pipeline-skills
Related Skills
advanced-rag
Advanced RAG patterns beyond basic retrieve-and-generate. Covers multi-hop RAG, agentic RAG with tool use, graph RAG (knowledge graphs + vector retrieval), recursive retrieval, self-querying retrievers, query decomposition, citation extraction, and corrective RAG. Includes implementation patterns and guidance on when each advanced technique is warranted.
chunking-strategies
Comprehensive guide to document chunking strategies for RAG pipelines. Covers fixed-size, semantic, recursive character, sentence-based, parent-child, markdown-aware, and code-aware chunking. Includes chunk size optimization, overlap strategies, and practical benchmarks for choosing the right approach based on document type and retrieval quality.
embedding-models
Guide to selecting, using, and optimizing text embedding models for RAG pipelines. Covers commercial models (OpenAI text-embedding-3, Cohere embed-v3, Voyage AI) and open-source options (BGE, E5, Nomic Embed). Includes dimensionality selection, batch processing, embedding caching, fine-tuning for domain-specific retrieval, and cost analysis.
rag-fundamentals
Teaches the foundational architecture of Retrieval-Augmented Generation (RAG) systems. Covers why RAG outperforms fine-tuning for most knowledge-grounding use cases, the three core stages (indexing, retrieval, generation), component design, latency budgets, and evaluation metrics including faithfulness, relevance, and hallucination rate. Use when building or explaining any RAG system from scratch.
rag-production
Production-grade RAG deployment patterns. Covers caching strategies (semantic and exact), streaming responses, token budget management, fallback strategies for retrieval failures, monitoring retrieval quality, cost optimization, incremental indexing, multi-tenancy, and operational best practices for running RAG systems at scale.
rag-with-langchain
Building RAG pipelines with LangChain and LangGraph. Covers document loaders, text splitters, vector stores, retrievers, chains, and agents. Includes practical patterns for conversational RAG, multi-source retrieval, streaming, and LangGraph-based agentic RAG workflows.