llm-eval-fundamentals
Covers the foundations of evaluating LLM-powered applications: why evaluation matters, the taxonomy of metric types (exact match, semantic similarity, LLM-as-judge), building and curating eval datasets, establishing baselines, detecting regressions, and designing eval pipelines that scale from prototyping through production. Triggers: "evaluate my LLM app", "set up evals", "how do I measure LLM quality", "create an eval pipeline", "LLM metrics", "eval dataset".
Build rigorous evaluation pipelines for LLM-powered applications. This skill covers metric selection, dataset construction, baseline tracking, and regression detection.
## Key Points
- Detect regressions when you change prompts, models, or retrieval logic
- Compare model providers or versions objectively
- Gate deployments on quality thresholds
- Justify AI spending to stakeholders
- Accuracy: Does the response contain correct information?
- Completeness: Does it address all parts of the question?
- Clarity: Is it well-organized and easy to understand?
- Covers all major use-case categories
- Includes known edge cases
- Has expert-verified expected outputs
- Runs fast enough to execute on every PR
1. **Determinism first**: Set `temperature=0` and `seed` for reproducible runs.
## Quick Example
```python
# golden_set.jsonl — keep in version control
{"id": "g-001", "input": "Summarize this contract clause...", "expected": "The clause limits liability to...", "metadata": {"category": "legal", "difficulty": "medium"}}
{"id": "g-002", "input": "Extract entities from...", "expected": "[{\"name\": \"Acme\", \"type\": \"ORG\"}]", "metadata": {"category": "extraction", "difficulty": "easy"}}
```skilldb get ai-testing-evals-skills/llm-eval-fundamentalsFull skill: 348 linesLLM Evaluation Fundamentals
Build rigorous evaluation pipelines for LLM-powered applications. This skill covers metric selection, dataset construction, baseline tracking, and regression detection.
Why Evaluation Matters
Vibes-based testing ("it looks good") fails at scale. Without structured evals you cannot:
- Detect regressions when you change prompts, models, or retrieval logic
- Compare model providers or versions objectively
- Gate deployments on quality thresholds
- Justify AI spending to stakeholders
Evaluation is not optional — it is the test suite for AI applications.
Metric Taxonomy
1. Exact Match Metrics
Use when the expected output is deterministic.
def exact_match(predicted: str, expected: str) -> float:
return 1.0 if predicted.strip() == expected.strip() else 0.0
def normalized_exact_match(predicted: str, expected: str) -> float:
"""Case-insensitive, whitespace-normalized exact match."""
normalize = lambda s: " ".join(s.lower().split())
return 1.0 if normalize(predicted) == normalize(expected) else 0.0
Best for: classification, entity extraction, code generation with known outputs.
2. Token-Overlap Metrics
from collections import Counter
def f1_score(predicted: str, expected: str) -> float:
pred_tokens = predicted.lower().split()
exp_tokens = expected.lower().split()
common = Counter(pred_tokens) & Counter(exp_tokens)
num_common = sum(common.values())
if num_common == 0:
return 0.0
precision = num_common / len(pred_tokens)
recall = num_common / len(exp_tokens)
return 2 * precision * recall / (precision + recall)
Best for: QA tasks, summarization rough-pass.
3. Semantic Similarity
import numpy as np
from openai import OpenAI
client = OpenAI()
def semantic_similarity(text_a: str, text_b: str, model="text-embedding-3-small") -> float:
"""Cosine similarity between embeddings."""
resp = client.embeddings.create(input=[text_a, text_b], model=model)
a = np.array(resp.data[0].embedding)
b = np.array(resp.data[1].embedding)
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
# Usage
score = semantic_similarity(
"The cat sat on the mat",
"A feline rested on a rug"
)
# Typically 0.85+ for semantically equivalent text
Best for: open-ended generation, paraphrasing, translation.
4. LLM-as-Judge
JUDGE_PROMPT = """You are an expert evaluator. Score the following response on a scale of 1-5.
Criteria:
- Accuracy: Does the response contain correct information?
- Completeness: Does it address all parts of the question?
- Clarity: Is it well-organized and easy to understand?
Question: {question}
Expected answer: {expected}
Actual response: {response}
Respond with JSON: {{"accuracy": int, "completeness": int, "clarity": int, "reasoning": str}}"""
async def llm_judge(question: str, expected: str, response: str) -> dict:
result = await client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": JUDGE_PROMPT.format(
question=question, expected=expected, response=response
)}],
response_format={"type": "json_object"},
temperature=0,
)
return json.loads(result.choices[0].message.content)
Best for: subjective quality, style, reasoning depth. See llm-as-judge.md for full coverage.
Building Eval Datasets
Dataset Structure
from dataclasses import dataclass
from typing import Optional
@dataclass
class EvalCase:
id: str
input: str # The prompt or question
expected: str # Gold-standard answer
metadata: dict # Tags: difficulty, category, source
context: Optional[str] = None # Retrieved docs (for RAG evals)
# Store as JSONL for streaming large datasets
import jsonlines
def save_dataset(cases: list[EvalCase], path: str):
with jsonlines.open(path, mode="w") as writer:
for case in cases:
writer.write(vars(case))
Dataset Sources
| Source | Method | Size |
|---|---|---|
| Production logs | Sample real queries, manually label | 50-200 |
| Domain experts | Hand-craft edge cases | 20-50 |
| Synthetic | Use a stronger model to generate Q&A pairs | 100-500 |
| Public benchmarks | MMLU, HumanEval, etc. | 1000+ |
Golden Test Sets
Curate a small (30-100 cases) high-quality "golden set" that:
- Covers all major use-case categories
- Includes known edge cases
- Has expert-verified expected outputs
- Runs fast enough to execute on every PR
# golden_set.jsonl — keep in version control
{"id": "g-001", "input": "Summarize this contract clause...", "expected": "The clause limits liability to...", "metadata": {"category": "legal", "difficulty": "medium"}}
{"id": "g-002", "input": "Extract entities from...", "expected": "[{\"name\": \"Acme\", \"type\": \"ORG\"}]", "metadata": {"category": "extraction", "difficulty": "easy"}}
Establishing Baselines
import json
from pathlib import Path
from datetime import datetime
class BaselineTracker:
def __init__(self, path: str = "eval_baselines.json"):
self.path = Path(path)
self.baselines = json.loads(self.path.read_text()) if self.path.exists() else {}
def record(self, run_name: str, metrics: dict):
self.baselines[run_name] = {
"metrics": metrics,
"timestamp": datetime.utcnow().isoformat(),
}
self.path.write_text(json.dumps(self.baselines, indent=2))
def check_regression(self, current: dict, baseline_name: str, threshold: float = 0.02) -> list[str]:
"""Return list of metrics that regressed beyond threshold."""
baseline = self.baselines[baseline_name]["metrics"]
regressions = []
for key, baseline_val in baseline.items():
current_val = current.get(key, 0)
if baseline_val - current_val > threshold:
regressions.append(
f"{key}: {baseline_val:.3f} -> {current_val:.3f} "
f"(delta: {current_val - baseline_val:+.3f})"
)
return regressions
# Usage
tracker = BaselineTracker()
tracker.record("v1.0-gpt4o", {"accuracy": 0.92, "f1": 0.88, "latency_p50": 1.2})
regressions = tracker.check_regression(
current={"accuracy": 0.89, "f1": 0.87, "latency_p50": 1.5},
baseline_name="v1.0-gpt4o",
threshold=0.02
)
if regressions:
print("REGRESSION DETECTED:")
for r in regressions:
print(f" - {r}")
Regression Detection in CI
# .github/workflows/eval.yml
name: LLM Eval
on: [pull_request]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install -r requirements.txt
- name: Run golden set evals
run: python run_evals.py --dataset golden_set.jsonl --output results.json
- name: Check for regressions
run: python check_regression.py --results results.json --baseline baselines/main.json --threshold 0.02
# check_regression.py
import json, sys
results = json.load(open(sys.argv[2]))
baseline = json.load(open(sys.argv[4]))
threshold = float(sys.argv[6])
failures = []
for metric, value in results["metrics"].items():
base_val = baseline["metrics"].get(metric, 0)
if base_val - value > threshold:
failures.append(f"{metric}: {base_val:.3f} -> {value:.3f}")
if failures:
print("EVAL REGRESSION DETECTED — blocking merge.")
for f in failures:
print(f" FAIL: {f}")
sys.exit(1)
else:
print("All metrics within threshold. PASS.")
Eval Pipeline Design
┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ Eval Dataset │────>│ Run Inference │────>│ Score Outputs│
│ (JSONL) │ │ (batched) │ │ (metrics) │
└─────────────┘ └──────────────┘ └──────┬──────┘
│
┌────────────┴────────────┐
│ │
┌─────▼─────┐ ┌──────▼──────┐
│ Regression │ │ Dashboard │
│ Check (CI) │ │ (tracking) │
└───────────┘ └─────────────┘
Key Principles
- Determinism first: Set
temperature=0andseedfor reproducible runs. - Parallelize inference: Use
asyncio.gatherto run eval cases concurrently. - Cache responses: Hash (model + prompt + params) to avoid re-running unchanged cases.
- Version everything: Dataset, prompts, model config, and baselines belong in git.
- Fail fast: Run the golden set (small, fast) on every PR; run the full suite nightly.
import asyncio
import hashlib
import json
class EvalRunner:
def __init__(self, client, model: str, cache_path: str = ".eval_cache"):
self.client = client
self.model = model
self.cache = Path(cache_path)
self.cache.mkdir(exist_ok=True)
def _cache_key(self, prompt: str, params: dict) -> str:
blob = json.dumps({"model": self.model, "prompt": prompt, **params}, sort_keys=True)
return hashlib.sha256(blob.encode()).hexdigest()
async def run_case(self, case: EvalCase, params: dict) -> str:
key = self._cache_key(case.input, params)
cache_file = self.cache / f"{key}.json"
if cache_file.exists():
return json.loads(cache_file.read_text())["response"]
resp = await self.client.chat.completions.create(
model=self.model,
messages=[{"role": "user", "content": case.input}],
**params,
)
output = resp.choices[0].message.content
cache_file.write_text(json.dumps({"response": output}))
return output
async def run_all(self, cases: list[EvalCase], params: dict, concurrency: int = 10):
semaphore = asyncio.Semaphore(concurrency)
async def bounded(case):
async with semaphore:
return await self.run_case(case, params)
return await asyncio.gather(*[bounded(c) for c in cases])
Choosing the Right Metrics
| Task | Primary Metric | Secondary |
|---|---|---|
| Classification | Exact match, F1 | Confusion matrix |
| Extraction | Field-level exact match | Partial credit |
| Summarization | LLM-as-judge, ROUGE | Semantic similarity |
| Code generation | Pass@k (execution) | Exact match |
| Open QA | LLM-as-judge | F1, semantic sim |
| RAG | Faithfulness + relevance | Retrieval recall |
| Chat / dialog | LLM-as-judge (multi-turn) | User preference |
Common Pitfalls
- Testing on training data: Ensure eval cases were not in the model's training set.
- Single-metric blindness: Always track at least 2 complementary metrics.
- Ignoring latency/cost: A 5% accuracy gain that triples cost may not be worth it.
- Overfitting to evals: Rotate and expand eval sets regularly.
- Non-deterministic baselines: Always pin temperature=0 and seed for baseline runs.
Install this skill directly: skilldb add ai-testing-evals-skills
Related Skills
agent-trajectory-testing
Covers testing AI agent behavior end-to-end: trajectory evaluation, tool-call sequence validation, multi-step correctness verification, stuck-loop detection, cost regression testing, and timeout handling. Triggers: "test my AI agent", "agent trajectory evaluation", "tool call testing", "multi-step agent testing", "agent stuck detection", "agent cost regression", "validate agent behavior".
ci-cd-for-ai
Covers implementing CI/CD pipelines for AI applications: running LLM evals in GitHub Actions, gating deployments on eval scores, monitoring prompt and model drift, versioning prompts alongside code, cost tracking, and canary deployments for AI features. Triggers: "CI for AI", "run evals in GitHub Actions", "gate deployment on eval score", "prompt drift detection", "version prompts in CI", "AI deployment pipeline", "LLM CI/CD".
eval-frameworks
Covers popular LLM evaluation frameworks and how to use them: Braintrust, Promptfoo, RAGAS, DeepEval, LangSmith, and custom eval harnesses. Includes setup, configuration, writing eval cases, CI integration, and choosing the right framework for your use case. Triggers: "eval framework", "Braintrust setup", "Promptfoo config", "RAGAS evaluation", "DeepEval", "LangSmith evals", "custom eval harness", "which eval tool should I use".
llm-as-judge
Covers using LLMs to evaluate other LLM outputs: rubric design, pairwise comparison, reference-based and reference-free grading, calibration techniques, inter-rater reliability measurement, and cost-efficient judging strategies. Triggers: "LLM as judge", "use GPT to evaluate outputs", "AI grading AI", "rubric for LLM evaluation", "pairwise comparison", "LLM evaluator", "auto-grade LLM responses".
prompt-testing
Covers testing and hardening prompts for LLM applications: prompt regression testing, A/B testing prompt variants, temperature sensitivity analysis, edge case libraries, prompt versioning strategies, and golden test sets. Triggers: "test my prompt", "prompt regression", "A/B test prompts", "prompt versioning", "temperature sensitivity", "golden test set for prompts", "prompt quality assurance".
red-teaming-ai
Covers red-teaming AI applications for safety and robustness: adversarial prompt testing, jailbreak resistance evaluation, PII leakage detection, hallucination measurement, bias detection, safety benchmarks, and building automated red-team pipelines. Triggers: "red team my AI", "adversarial testing for LLMs", "jailbreak testing", "PII leakage test", "hallucination detection", "AI bias testing", "safety benchmark", "AI security testing".