Technology & EngineeringAi Testing Evals348 lines

llm-eval-fundamentals

Covers the foundations of evaluating LLM-powered applications: why evaluation matters, the taxonomy of metric types (exact match, semantic similarity, LLM-as-judge), building and curating eval datasets, establishing baselines, detecting regressions, and designing eval pipelines that scale from prototyping through production. Triggers: "evaluate my LLM app", "set up evals", "how do I measure LLM quality", "create an eval pipeline", "LLM metrics", "eval dataset".

Quick Summary26 lines

Build rigorous evaluation pipelines for LLM-powered applications. This skill covers metric selection, dataset construction, baseline tracking, and regression detection.

## Key Points

- Detect regressions when you change prompts, models, or retrieval logic
- Compare model providers or versions objectively
- Gate deployments on quality thresholds
- Justify AI spending to stakeholders
- Accuracy: Does the response contain correct information?
- Completeness: Does it address all parts of the question?
- Clarity: Is it well-organized and easy to understand?
- Covers all major use-case categories
- Includes known edge cases
- Has expert-verified expected outputs
- Runs fast enough to execute on every PR
1. **Determinism first**: Set `temperature=0` and `seed` for reproducible runs.

## Quick Example

```python
# golden_set.jsonl — keep in version control
{"id": "g-001", "input": "Summarize this contract clause...", "expected": "The clause limits liability to...", "metadata": {"category": "legal", "difficulty": "medium"}}
{"id": "g-002", "input": "Extract entities from...", "expected": "[{\"name\": \"Acme\", \"type\": \"ORG\"}]", "metadata": {"category": "extraction", "difficulty": "easy"}}
```

skilldb get ai-testing-evals-skills/llm-eval-fundamentalsFull skill: 348 lines

Paste into your CLAUDE.md or agent config

LLM Evaluation Fundamentals

Build rigorous evaluation pipelines for LLM-powered applications. This skill covers metric selection, dataset construction, baseline tracking, and regression detection.

Why Evaluation Matters

Vibes-based testing ("it looks good") fails at scale. Without structured evals you cannot:

Detect regressions when you change prompts, models, or retrieval logic
Compare model providers or versions objectively
Gate deployments on quality thresholds
Justify AI spending to stakeholders

Evaluation is not optional — it is the test suite for AI applications.

Metric Taxonomy

1. Exact Match Metrics

Use when the expected output is deterministic.

def exact_match(predicted: str, expected: str) -> float:
    return 1.0 if predicted.strip() == expected.strip() else 0.0

def normalized_exact_match(predicted: str, expected: str) -> float:
    """Case-insensitive, whitespace-normalized exact match."""
    normalize = lambda s: " ".join(s.lower().split())
    return 1.0 if normalize(predicted) == normalize(expected) else 0.0

Best for: classification, entity extraction, code generation with known outputs.

2. Token-Overlap Metrics

from collections import Counter

def f1_score(predicted: str, expected: str) -> float:
    pred_tokens = predicted.lower().split()
    exp_tokens = expected.lower().split()
    common = Counter(pred_tokens) & Counter(exp_tokens)
    num_common = sum(common.values())
    if num_common == 0:
        return 0.0
    precision = num_common / len(pred_tokens)
    recall = num_common / len(exp_tokens)
    return 2 * precision * recall / (precision + recall)

Best for: QA tasks, summarization rough-pass.

3. Semantic Similarity

import numpy as np
from openai import OpenAI

client = OpenAI()

def semantic_similarity(text_a: str, text_b: str, model="text-embedding-3-small") -> float:
    """Cosine similarity between embeddings."""
    resp = client.embeddings.create(input=[text_a, text_b], model=model)
    a = np.array(resp.data[0].embedding)
    b = np.array(resp.data[1].embedding)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

# Usage
score = semantic_similarity(
    "The cat sat on the mat",
    "A feline rested on a rug"
)
# Typically 0.85+ for semantically equivalent text

Best for: open-ended generation, paraphrasing, translation.

4. LLM-as-Judge

JUDGE_PROMPT = """You are an expert evaluator. Score the following response on a scale of 1-5.

Criteria:
- Accuracy: Does the response contain correct information?
- Completeness: Does it address all parts of the question?
- Clarity: Is it well-organized and easy to understand?

Question: {question}
Expected answer: {expected}
Actual response: {response}

Respond with JSON: {{"accuracy": int, "completeness": int, "clarity": int, "reasoning": str}}"""

async def llm_judge(question: str, expected: str, response: str) -> dict:
    result = await client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": JUDGE_PROMPT.format(
            question=question, expected=expected, response=response
        )}],
        response_format={"type": "json_object"},
        temperature=0,
    )
    return json.loads(result.choices[0].message.content)

Best for: subjective quality, style, reasoning depth. See llm-as-judge.md for full coverage.

Building Eval Datasets

Dataset Structure

from dataclasses import dataclass
from typing import Optional

@dataclass
class EvalCase:
    id: str
    input: str                    # The prompt or question
    expected: str                 # Gold-standard answer
    metadata: dict                # Tags: difficulty, category, source
    context: Optional[str] = None # Retrieved docs (for RAG evals)

# Store as JSONL for streaming large datasets
import jsonlines

def save_dataset(cases: list[EvalCase], path: str):
    with jsonlines.open(path, mode="w") as writer:
        for case in cases:
            writer.write(vars(case))

Dataset Sources

Source	Method	Size
Production logs	Sample real queries, manually label	50-200
Domain experts	Hand-craft edge cases	20-50
Synthetic	Use a stronger model to generate Q&A pairs	100-500
Public benchmarks	MMLU, HumanEval, etc.	1000+

Golden Test Sets

Curate a small (30-100 cases) high-quality "golden set" that:

Covers all major use-case categories
Includes known edge cases
Has expert-verified expected outputs
Runs fast enough to execute on every PR

# golden_set.jsonl — keep in version control
{"id": "g-001", "input": "Summarize this contract clause...", "expected": "The clause limits liability to...", "metadata": {"category": "legal", "difficulty": "medium"}}
{"id": "g-002", "input": "Extract entities from...", "expected": "[{\"name\": \"Acme\", \"type\": \"ORG\"}]", "metadata": {"category": "extraction", "difficulty": "easy"}}

Establishing Baselines

import json
from pathlib import Path
from datetime import datetime

class BaselineTracker:
    def __init__(self, path: str = "eval_baselines.json"):
        self.path = Path(path)
        self.baselines = json.loads(self.path.read_text()) if self.path.exists() else {}

    def record(self, run_name: str, metrics: dict):
        self.baselines[run_name] = {
            "metrics": metrics,
            "timestamp": datetime.utcnow().isoformat(),
        }
        self.path.write_text(json.dumps(self.baselines, indent=2))

    def check_regression(self, current: dict, baseline_name: str, threshold: float = 0.02) -> list[str]:
        """Return list of metrics that regressed beyond threshold."""
        baseline = self.baselines[baseline_name]["metrics"]
        regressions = []
        for key, baseline_val in baseline.items():
            current_val = current.get(key, 0)
            if baseline_val - current_val > threshold:
                regressions.append(
                    f"{key}: {baseline_val:.3f} -> {current_val:.3f} "
                    f"(delta: {current_val - baseline_val:+.3f})"
                )
        return regressions

# Usage
tracker = BaselineTracker()
tracker.record("v1.0-gpt4o", {"accuracy": 0.92, "f1": 0.88, "latency_p50": 1.2})

regressions = tracker.check_regression(
    current={"accuracy": 0.89, "f1": 0.87, "latency_p50": 1.5},
    baseline_name="v1.0-gpt4o",
    threshold=0.02
)
if regressions:
    print("REGRESSION DETECTED:")
    for r in regressions:
        print(f"  - {r}")

Regression Detection in CI

# .github/workflows/eval.yml
name: LLM Eval
on: [pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install -r requirements.txt
      - name: Run golden set evals
        run: python run_evals.py --dataset golden_set.jsonl --output results.json
      - name: Check for regressions
        run: python check_regression.py --results results.json --baseline baselines/main.json --threshold 0.02

# check_regression.py
import json, sys

results = json.load(open(sys.argv[2]))
baseline = json.load(open(sys.argv[4]))
threshold = float(sys.argv[6])

failures = []
for metric, value in results["metrics"].items():
    base_val = baseline["metrics"].get(metric, 0)
    if base_val - value > threshold:
        failures.append(f"{metric}: {base_val:.3f} -> {value:.3f}")

if failures:
    print("EVAL REGRESSION DETECTED — blocking merge.")
    for f in failures:
        print(f"  FAIL: {f}")
    sys.exit(1)
else:
    print("All metrics within threshold. PASS.")

Eval Pipeline Design

┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│ Eval Dataset │────>│ Run Inference │────>│ Score Outputs│
│  (JSONL)     │     │  (batched)   │     │  (metrics)  │
└─────────────┘     └──────────────┘     └──────┬──────┘
                                                 │
                                    ┌────────────┴────────────┐
                                    │                         │
                              ┌─────▼─────┐           ┌──────▼──────┐
                              │ Regression │           │ Dashboard   │
                              │ Check (CI) │           │ (tracking)  │
                              └───────────┘           └─────────────┘

Key Principles

Determinism first: Set temperature=0 and seed for reproducible runs.
Parallelize inference: Use asyncio.gather to run eval cases concurrently.
Cache responses: Hash (model + prompt + params) to avoid re-running unchanged cases.
Version everything: Dataset, prompts, model config, and baselines belong in git.
Fail fast: Run the golden set (small, fast) on every PR; run the full suite nightly.

import asyncio
import hashlib
import json

class EvalRunner:
    def __init__(self, client, model: str, cache_path: str = ".eval_cache"):
        self.client = client
        self.model = model
        self.cache = Path(cache_path)
        self.cache.mkdir(exist_ok=True)

    def _cache_key(self, prompt: str, params: dict) -> str:
        blob = json.dumps({"model": self.model, "prompt": prompt, **params}, sort_keys=True)
        return hashlib.sha256(blob.encode()).hexdigest()

    async def run_case(self, case: EvalCase, params: dict) -> str:
        key = self._cache_key(case.input, params)
        cache_file = self.cache / f"{key}.json"
        if cache_file.exists():
            return json.loads(cache_file.read_text())["response"]

        resp = await self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": case.input}],
            **params,
        )
        output = resp.choices[0].message.content
        cache_file.write_text(json.dumps({"response": output}))
        return output

    async def run_all(self, cases: list[EvalCase], params: dict, concurrency: int = 10):
        semaphore = asyncio.Semaphore(concurrency)
        async def bounded(case):
            async with semaphore:
                return await self.run_case(case, params)
        return await asyncio.gather(*[bounded(c) for c in cases])

Choosing the Right Metrics

Task	Primary Metric	Secondary
Classification	Exact match, F1	Confusion matrix
Extraction	Field-level exact match	Partial credit
Summarization	LLM-as-judge, ROUGE	Semantic similarity
Code generation	Pass@k (execution)	Exact match
Open QA	LLM-as-judge	F1, semantic sim
RAG	Faithfulness + relevance	Retrieval recall
Chat / dialog	LLM-as-judge (multi-turn)	User preference

Common Pitfalls

Testing on training data: Ensure eval cases were not in the model's training set.
Single-metric blindness: Always track at least 2 complementary metrics.
Ignoring latency/cost: A 5% accuracy gain that triples cost may not be worth it.
Overfitting to evals: Rotate and expand eval sets regularly.
Non-deterministic baselines: Always pin temperature=0 and seed for baseline runs.

Install this skill directly: skilldb add ai-testing-evals-skills

Get CLI access →

Related Skills

agent-trajectory-testing

Covers testing AI agent behavior end-to-end: trajectory evaluation, tool-call sequence validation, multi-step correctness verification, stuck-loop detection, cost regression testing, and timeout handling. Triggers: "test my AI agent", "agent trajectory evaluation", "tool call testing", "multi-step agent testing", "agent stuck detection", "agent cost regression", "validate agent behavior".

Ai Testing Evals•472L

ci-cd-for-ai

Covers implementing CI/CD pipelines for AI applications: running LLM evals in GitHub Actions, gating deployments on eval scores, monitoring prompt and model drift, versioning prompts alongside code, cost tracking, and canary deployments for AI features. Triggers: "CI for AI", "run evals in GitHub Actions", "gate deployment on eval score", "prompt drift detection", "version prompts in CI", "AI deployment pipeline", "LLM CI/CD".

Ai Testing Evals•479L

eval-frameworks

Covers popular LLM evaluation frameworks and how to use them: Braintrust, Promptfoo, RAGAS, DeepEval, LangSmith, and custom eval harnesses. Includes setup, configuration, writing eval cases, CI integration, and choosing the right framework for your use case. Triggers: "eval framework", "Braintrust setup", "Promptfoo config", "RAGAS evaluation", "DeepEval", "LangSmith evals", "custom eval harness", "which eval tool should I use".

Ai Testing Evals•568L

llm-as-judge

Covers using LLMs to evaluate other LLM outputs: rubric design, pairwise comparison, reference-based and reference-free grading, calibration techniques, inter-rater reliability measurement, and cost-efficient judging strategies. Triggers: "LLM as judge", "use GPT to evaluate outputs", "AI grading AI", "rubric for LLM evaluation", "pairwise comparison", "LLM evaluator", "auto-grade LLM responses".

Ai Testing Evals•451L

prompt-testing

Covers testing and hardening prompts for LLM applications: prompt regression testing, A/B testing prompt variants, temperature sensitivity analysis, edge case libraries, prompt versioning strategies, and golden test sets. Triggers: "test my prompt", "prompt regression", "A/B test prompts", "prompt versioning", "temperature sensitivity", "golden test set for prompts", "prompt quality assurance".

Ai Testing Evals•447L

red-teaming-ai

Covers red-teaming AI applications for safety and robustness: adversarial prompt testing, jailbreak resistance evaluation, PII leakage detection, hallucination measurement, bias detection, safety benchmarks, and building automated red-team pipelines. Triggers: "red team my AI", "adversarial testing for LLMs", "jailbreak testing", "PII leakage test", "hallucination detection", "AI bias testing", "safety benchmark", "AI security testing".

Ai Testing Evals•544L