Skip to main content
Technology & EngineeringRag Pipeline501 lines

rag-evaluation

Evaluating RAG systems end-to-end. Covers retrieval metrics (context precision, context recall, MRR), generation metrics (faithfulness, answer relevance, hallucination detection), the RAGAS framework, human evaluation protocols, A/B testing retrieval strategies, building evaluation datasets, and continuous monitoring in production.

Quick Summary18 lines
Measure and improve every stage of your RAG pipeline with systematic evaluation.

## Key Points

1. **Retrieval** can return irrelevant or incomplete context
2. **Generation** can hallucinate, ignore context, or give irrelevant answers
- SUPPORTED: directly supported by context
- NOT_SUPPORTED: not found in or contradicts context
- GROUNDED: supported by the context
- HALLUCINATED: not in the context
1. **Evaluating only end-to-end** -- If answers are bad, you need to know if retrieval or generation is the bottleneck. Always measure both separately.
2. **Using the same LLM as judge and generator** -- The evaluator LLM may have the same blind spots. Use a different model or human eval for critical assessments.
3. **Eval set too small** -- 10 questions is not an eval set. Aim for 50 minimum, 200+ for statistical significance across categories.
4. **Not stratifying eval questions** -- Group questions by type (factual, comparative, procedural, etc.) and by difficulty. Overall averages hide category-specific failures.
5. **One-time evaluation** -- RAG quality drifts as documents change. Run evaluations weekly or on every index update.
6. **Ignoring low-confidence retrievals** -- Monitor the distribution of similarity scores. A spike in low scores means your corpus may be missing information or your queries are shifting.
skilldb get rag-pipeline-skills/rag-evaluationFull skill: 501 lines
Paste into your CLAUDE.md or agent config

RAG Evaluation

Measure and improve every stage of your RAG pipeline with systematic evaluation.


Why RAG Evaluation Is Hard

RAG has two components that can fail independently:

  1. Retrieval can return irrelevant or incomplete context
  2. Generation can hallucinate, ignore context, or give irrelevant answers

You need separate metrics for each stage plus end-to-end metrics.


Building an Evaluation Dataset

Start here. Without a good eval set, you cannot measure anything.

# Minimum viable eval set: 50-100 examples
eval_dataset = [
    {
        "question": "How do I reset my password?",
        "ground_truth": "Go to Settings > Security > Reset Password. You'll receive an email link.",
        "expected_sources": ["user-guide.md"],  # For retrieval eval
    },
    {
        "question": "What authentication methods are supported?",
        "ground_truth": "OAuth2, API keys, and SAML SSO are supported.",
        "expected_sources": ["auth-docs.md", "sso-guide.md"],
    },
]

# Generate eval questions from your corpus using LLM
def generate_eval_questions(chunks, llm, n_questions=50):
    """Generate question-answer pairs from document chunks."""
    qa_pairs = []
    import random
    sampled = random.sample(chunks, min(len(chunks), n_questions * 2))

    for chunk in sampled:
        if len(qa_pairs) >= n_questions:
            break
        response = llm.invoke(
            f"""Generate a question-answer pair from this text.
The question should be something a user would naturally ask.
The answer should be directly supported by the text.

Text: {chunk.page_content}

Format:
Question: <question>
Answer: <answer>"""
        ).content

        lines = response.strip().split("\n")
        q = a = ""
        for line in lines:
            if line.startswith("Question:"):
                q = line.replace("Question:", "").strip()
            elif line.startswith("Answer:"):
                a = line.replace("Answer:", "").strip()
        if q and a:
            qa_pairs.append({
                "question": q,
                "ground_truth": a,
                "expected_sources": [chunk.metadata.get("source", "unknown")],
            })

    return qa_pairs

Retrieval Metrics

Context Precision

Are the top-k retrieved chunks actually relevant?

def context_precision(retrieved_docs, expected_sources, k=5):
    """What fraction of retrieved docs are from expected sources?"""
    relevant = 0
    for doc in retrieved_docs[:k]:
        source = doc.metadata.get("source", "")
        if source in expected_sources:
            relevant += 1
    return relevant / k

# Example: 3 out of 5 retrieved from expected sources = 0.6 precision

Context Recall

Did we retrieve all the information needed to answer?

def context_recall(retrieved_docs, ground_truth, llm):
    """Can the ground truth answer be derived from retrieved context?"""
    context = "\n".join(d.page_content for d in retrieved_docs)
    response = llm.invoke(
        f"""Given this context, determine what fraction of the ground truth
answer is supported by the context.

Context: {context}

Ground truth answer: {ground_truth}

Score from 0.0 to 1.0 where 1.0 means fully supported:"""
    ).content
    try:
        return float(response.strip())
    except ValueError:
        return 0.0

Mean Reciprocal Rank (MRR)

How high does the first relevant result rank?

def mean_reciprocal_rank(queries_results):
    """Calculate MRR across multiple queries.
    queries_results: list of (retrieved_docs, expected_sources) tuples
    """
    reciprocal_ranks = []
    for retrieved_docs, expected_sources in queries_results:
        for rank, doc in enumerate(retrieved_docs, 1):
            if doc.metadata.get("source", "") in expected_sources:
                reciprocal_ranks.append(1 / rank)
                break
        else:
            reciprocal_ranks.append(0)

    return sum(reciprocal_ranks) / len(reciprocal_ranks)

Normalized Discounted Cumulative Gain (nDCG)

import numpy as np

def ndcg_at_k(retrieved_docs, expected_sources, k=5):
    """Calculate nDCG@k for a single query."""
    relevance = []
    for doc in retrieved_docs[:k]:
        if doc.metadata.get("source", "") in expected_sources:
            relevance.append(1)
        else:
            relevance.append(0)

    # DCG
    dcg = sum(rel / np.log2(i + 2) for i, rel in enumerate(relevance))

    # Ideal DCG (all relevant docs at top)
    ideal_relevance = sorted(relevance, reverse=True)
    idcg = sum(rel / np.log2(i + 2) for i, rel in enumerate(ideal_relevance))

    return dcg / idcg if idcg > 0 else 0

Generation Metrics

Faithfulness

Does the answer only use information from the retrieved context?

def evaluate_faithfulness(answer, context, llm):
    """Score how faithful the answer is to the provided context."""
    prompt = f"""Evaluate whether the answer is faithful to the context.
A faithful answer only contains information that can be derived from the context.

Context: {context}

Answer: {answer}

For each sentence in the answer, determine if it is:
- SUPPORTED: directly supported by context
- NOT_SUPPORTED: not found in or contradicts context

Then give an overall faithfulness score from 0.0 to 1.0.

Evaluation:"""
    response = llm.invoke(prompt).content
    # Parse score from response
    lines = response.strip().split("\n")
    for line in reversed(lines):
        try:
            score = float(line.strip().split()[-1])
            if 0 <= score <= 1:
                return score
        except (ValueError, IndexError):
            continue
    return 0.5  # Fallback

Answer Relevance

Does the answer actually address the question?

def evaluate_answer_relevance(question, answer, llm):
    """Score how relevant the answer is to the question."""
    prompt = f"""Rate how well the answer addresses the question.
Score 1.0 if perfectly relevant, 0.0 if completely irrelevant.

Question: {question}
Answer: {answer}

Relevance score (0.0 to 1.0):"""
    response = llm.invoke(prompt).content
    try:
        return float(response.strip())
    except ValueError:
        return 0.5

Hallucination Detection

def detect_hallucinations(answer, context, llm):
    """Identify specific hallucinated claims in the answer."""
    prompt = f"""Analyze the answer for hallucinations (claims not supported by context).

Context:
{context}

Answer:
{answer}

List each claim in the answer and mark it as:
- GROUNDED: supported by the context
- HALLUCINATED: not in the context

Claims:"""
    response = llm.invoke(prompt).content

    hallucinated = []
    grounded = []
    for line in response.strip().split("\n"):
        if "HALLUCINATED" in line.upper():
            hallucinated.append(line.split(":")[0].strip() if ":" in line else line)
        elif "GROUNDED" in line.upper():
            grounded.append(line.split(":")[0].strip() if ":" in line else line)

    total = len(hallucinated) + len(grounded)
    hallucination_rate = len(hallucinated) / total if total > 0 else 0

    return {
        "hallucination_rate": hallucination_rate,
        "hallucinated_claims": hallucinated,
        "grounded_claims": grounded,
    }

RAGAS Framework

The standard framework for RAG evaluation.

# pip install ragas

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
    answer_correctness,
)
from datasets import Dataset

# Prepare dataset in RAGAS format
eval_data = {
    "question": [
        "How do I reset my password?",
        "What auth methods are supported?",
    ],
    "answer": [
        "Go to Settings > Security > Reset Password.",       # Generated answer
        "We support OAuth2 and API keys.",                     # Generated answer
    ],
    "contexts": [
        ["To reset your password, navigate to Settings > Security > Reset Password. A link will be emailed."],
        ["Authentication methods include OAuth2, API keys, and SAML SSO."],
    ],
    "ground_truth": [
        "Go to Settings > Security > Reset Password. You'll receive an email link.",
        "OAuth2, API keys, and SAML SSO are supported.",
    ],
}

dataset = Dataset.from_dict(eval_data)

# Run evaluation
results = evaluate(
    dataset=dataset,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_precision,
        context_recall,
        answer_correctness,
    ],
)

print(results)
# {'faithfulness': 0.92, 'answer_relevancy': 0.88, 'context_precision': 0.85,
#  'context_recall': 0.78, 'answer_correctness': 0.83}

# Per-question results
df = results.to_pandas()
print(df)

RAGAS Metrics Explained

MetricMeasuresNeedsRange
faithfulnessAnswer grounded in context?answer, contexts0-1
answer_relevancyAnswer addresses question?question, answer0-1
context_precisionRetrieved docs relevant?question, contexts, ground_truth0-1
context_recallAll needed info retrieved?contexts, ground_truth0-1
answer_correctnessAnswer matches ground truth?answer, ground_truth0-1

A/B Testing Retrieval Strategies

import random
from collections import defaultdict

class RetrievalABTest:
    """A/B test different retrieval strategies."""

    def __init__(self, strategies: dict, eval_questions: list):
        self.strategies = strategies  # {"name": retriever}
        self.eval_questions = eval_questions
        self.results = defaultdict(list)

    def run(self):
        for item in self.eval_questions:
            question = item["question"]
            expected = item["expected_sources"]
            ground_truth = item["ground_truth"]

            for name, retriever in self.strategies.items():
                docs = retriever.invoke(question)
                precision = context_precision(docs, expected)
                self.results[name].append({
                    "question": question,
                    "precision": precision,
                    "num_results": len(docs),
                })

    def summary(self):
        for name, metrics in self.results.items():
            avg_precision = sum(m["precision"] for m in metrics) / len(metrics)
            print(f"{name}: avg precision = {avg_precision:.3f}")

# Usage
test = RetrievalABTest(
    strategies={
        "dense_only": dense_retriever,
        "bm25_only": bm25_retriever,
        "hybrid_30_70": hybrid_retriever,
        "hybrid_50_50": hybrid_50_50_retriever,
    },
    eval_questions=eval_dataset,
)
test.run()
test.summary()

Human Evaluation Protocol

Automated metrics cannot catch everything. Build a human eval loop.

# Generate evaluation samples
def create_human_eval_batch(questions, rag_chain, n=20):
    """Create a batch for human evaluation."""
    import random
    sampled = random.sample(questions, min(n, len(questions)))
    batch = []

    for q in sampled:
        result = rag_chain.invoke(q["question"])
        batch.append({
            "id": len(batch),
            "question": q["question"],
            "generated_answer": result["result"] if isinstance(result, dict) else result,
            "retrieved_sources": [d.metadata.get("source") for d in result.get("source_documents", [])],
            # Human fills these in:
            "correctness": None,       # 1-5 scale
            "completeness": None,      # 1-5 scale
            "harmfulness": None,       # yes/no
            "citation_accuracy": None, # 1-5 scale
            "notes": "",
        })

    return batch

# Scoring rubric:
# Correctness (1-5): 1=wrong, 3=partially correct, 5=fully correct
# Completeness (1-5): 1=missing key info, 3=adequate, 5=comprehensive
# Citation accuracy (1-5): 1=wrong citations, 3=some correct, 5=all correct

Inter-Annotator Agreement

from sklearn.metrics import cohen_kappa_score

def inter_annotator_agreement(annotator1_scores, annotator2_scores):
    """Calculate Cohen's Kappa between two annotators."""
    kappa = cohen_kappa_score(annotator1_scores, annotator2_scores)
    print(f"Cohen's Kappa: {kappa:.3f}")
    # < 0.20 = poor, 0.21-0.40 = fair, 0.41-0.60 = moderate,
    # 0.61-0.80 = substantial, 0.81-1.00 = almost perfect
    return kappa

Continuous Evaluation in Production

import logging
from datetime import datetime

class RAGMonitor:
    """Monitor RAG quality in production."""

    def __init__(self, db_connection):
        self.db = db_connection

    def log_interaction(self, question, answer, retrieved_docs, latency_ms, user_feedback=None):
        """Log every RAG interaction for later analysis."""
        self.db.insert("rag_interactions", {
            "timestamp": datetime.utcnow().isoformat(),
            "question": question,
            "answer": answer,
            "num_docs_retrieved": len(retrieved_docs),
            "top_similarity_score": retrieved_docs[0].metadata.get("score") if retrieved_docs else None,
            "sources": [d.metadata.get("source") for d in retrieved_docs],
            "latency_ms": latency_ms,
            "user_feedback": user_feedback,  # thumbs up/down
        })

    def get_quality_report(self, days=7):
        """Generate quality report for the last N days."""
        interactions = self.db.query(f"SELECT * FROM rag_interactions WHERE timestamp > NOW() - INTERVAL '{days} days'")
        total = len(interactions)
        thumbs_up = sum(1 for i in interactions if i["user_feedback"] == "positive")
        thumbs_down = sum(1 for i in interactions if i["user_feedback"] == "negative")
        no_feedback = total - thumbs_up - thumbs_down
        avg_latency = sum(i["latency_ms"] for i in interactions) / total

        return {
            "total_queries": total,
            "satisfaction_rate": thumbs_up / (thumbs_up + thumbs_down) if (thumbs_up + thumbs_down) > 0 else None,
            "feedback_rate": (thumbs_up + thumbs_down) / total,
            "avg_latency_ms": avg_latency,
            "p95_latency_ms": sorted([i["latency_ms"] for i in interactions])[int(total * 0.95)],
        }

Anti-Patterns

  1. Evaluating only end-to-end -- If answers are bad, you need to know if retrieval or generation is the bottleneck. Always measure both separately.

  2. Using the same LLM as judge and generator -- The evaluator LLM may have the same blind spots. Use a different model or human eval for critical assessments.

  3. Eval set too small -- 10 questions is not an eval set. Aim for 50 minimum, 200+ for statistical significance across categories.

  4. Not stratifying eval questions -- Group questions by type (factual, comparative, procedural, etc.) and by difficulty. Overall averages hide category-specific failures.

  5. One-time evaluation -- RAG quality drifts as documents change. Run evaluations weekly or on every index update.

  6. Ignoring low-confidence retrievals -- Monitor the distribution of similarity scores. A spike in low scores means your corpus may be missing information or your queries are shifting.

Install this skill directly: skilldb add rag-pipeline-skills

Get CLI access →

Related Skills

advanced-rag

Advanced RAG patterns beyond basic retrieve-and-generate. Covers multi-hop RAG, agentic RAG with tool use, graph RAG (knowledge graphs + vector retrieval), recursive retrieval, self-querying retrievers, query decomposition, citation extraction, and corrective RAG. Includes implementation patterns and guidance on when each advanced technique is warranted.

Rag Pipeline464L

chunking-strategies

Comprehensive guide to document chunking strategies for RAG pipelines. Covers fixed-size, semantic, recursive character, sentence-based, parent-child, markdown-aware, and code-aware chunking. Includes chunk size optimization, overlap strategies, and practical benchmarks for choosing the right approach based on document type and retrieval quality.

Rag Pipeline343L

embedding-models

Guide to selecting, using, and optimizing text embedding models for RAG pipelines. Covers commercial models (OpenAI text-embedding-3, Cohere embed-v3, Voyage AI) and open-source options (BGE, E5, Nomic Embed). Includes dimensionality selection, batch processing, embedding caching, fine-tuning for domain-specific retrieval, and cost analysis.

Rag Pipeline357L

rag-fundamentals

Teaches the foundational architecture of Retrieval-Augmented Generation (RAG) systems. Covers why RAG outperforms fine-tuning for most knowledge-grounding use cases, the three core stages (indexing, retrieval, generation), component design, latency budgets, and evaluation metrics including faithfulness, relevance, and hallucination rate. Use when building or explaining any RAG system from scratch.

Rag Pipeline266L

rag-production

Production-grade RAG deployment patterns. Covers caching strategies (semantic and exact), streaming responses, token budget management, fallback strategies for retrieval failures, monitoring retrieval quality, cost optimization, incremental indexing, multi-tenancy, and operational best practices for running RAG systems at scale.

Rag Pipeline498L

rag-with-langchain

Building RAG pipelines with LangChain and LangGraph. Covers document loaders, text splitters, vector stores, retrievers, chains, and agents. Includes practical patterns for conversational RAG, multi-source retrieval, streaming, and LangGraph-based agentic RAG workflows.

Rag Pipeline460L