Skip to main content
Technology & EngineeringAgentic Loops371 lines

eval-driven-loop

An eval → improve-one-thing → re-eval hill-climbing loop for developing an LLM feature

Quick Summary21 lines
A loop that turns "this prompt feels better" into "this prompt scores 3 points higher and
regressed nothing." You freeze a labeled eval set, **run the current system over it, score
each component, propose exactly ONE change, re-run the SAME evals, and keep the change only
if it hill-climbs** — score holds or rises, and no category drops past its threshold. If it

## Key Points

- **The eval set is the source of truth — freeze it.** A change is "good" iff the frozen
- **ONE change per iteration.** A prompt edit, a few-shot example, a tool, a guardrail —
- **Score COMPONENTS, never one number.** accuracy, format-compliance, safety, latency.
- **Hold out a TEST split you NEVER optimize against.** You will overfit the dev set — it's
- **The LLM-as-judge needs its OWN validation.** A judge model is a measuring instrument
- **Beware Goodhart.** The metric is a proxy for the goal, not the goal. Keep 3–5
- **A hard regression gate, every iteration.** Aggregate must not drop AND no category may
- ONE change. Not "tighten the prompt AND add an example." One. A batched change is
- Target a CLASS of failure visible across multiple worst rows, not one example
- A new few-shot example MUST come from a REAL failure, paraphrased — never copy a
- Do NOT touch the eval set, the gold labels, the judge, or the thresholds. Those are
- State, in `rationale`, which failure class this attacks and why it shouldn't regress
skilldb get agentic-loops-skills/eval-driven-loopFull skill: 371 lines
Paste into your CLAUDE.md or agent config

Eval-Driven Loop

A loop that turns "this prompt feels better" into "this prompt scores 3 points higher and regressed nothing." You freeze a labeled eval set, run the current system over it, score each component, propose exactly ONE change, re-run the SAME evals, and keep the change only if it hill-climbs — score holds or rises, and no category drops past its threshold. If it doesn't, you revert. Repeat until the dev score plateaus while the held-out score tracks it.

It was built and run 14 iterations against a support-ticket classifier+responder (180 dev examples, 60 held-out). The blended dev score moved 71 → 89; held-out moved 70 → 86 (it tracked, so we weren't overfitting). 5 of the 14 proposed changes were reverted by the gate — they raised one category and quietly tanked another.


1. The philosophy

  • The eval set is the source of truth — freeze it. A change is "good" iff the frozen set says so. The instant you tweak the prompt and the eval set in the same iteration, your delta is meaningless. Version the eval set in git; bump its version only between iterations, never during one.
  • ONE change per iteration. A prompt edit, a few-shot example, a tool, a guardrail — one. Batch two and you can't tell which one helped and which one secretly hurt. The whole point is an attributable delta. Two changes = no signal.
  • Score COMPONENTS, never one number. accuracy, format-compliance, safety, latency. A single blended score hides regressions — "format compliance jumped" can mask "the actual answers got worse." You optimize what you can see.
  • Hold out a TEST split you NEVER optimize against. You will overfit the dev set — it's not a moral failing, it's gradient following. The held-out split is the only thing that tells you whether you improved the system or just the eval. If dev climbs and held-out doesn't, you're memorizing.
  • The LLM-as-judge needs its OWN validation. A judge model is a measuring instrument with drift. Spot-check ~20 of its verdicts against human labels every few iterations. If the judge silently raised its bar, your "regression" is the judge, not the system.
  • Beware Goodhart. The metric is a proxy for the goal, not the goal. Keep 3–5 qualitative spot-checks per iteration — if the number went up but the outputs read worse, the metric is being gamed and you've optimized the proxy off a cliff.
  • A hard regression gate, every iteration. Aggregate must not drop AND no category may fall past its threshold. A change that lowers the score didn't happen, however nice the diff looks.

2. The loop (one iteration)

┌─────────────────────────────────────────────────────────────────┐
│  0. FREEZE    eval set vN (dev split + held-out split), in git   │
│  1. BASELINE  run system over DEV set → component scores         │
│  2. PROPOSE   agent reads failures → ONE change (prompt/shot/    │
│               tool/guardrail), nothing else                      │
│  3. RE-EVAL   run the SAME dev set → new component scores        │
│  4. GATE      aggregate ≥ baseline  AND  no category < threshold │
│       PASS → keep the change, set baseline = new                 │
│       FAIL → REVERT the change (git checkout), record why        │
│  5. HELD-OUT  every K accepts: run held-out → must track dev     │
│  6. JUDGE-CK  every K iters: 20 judge verdicts vs human labels   │
│  7. COMMIT    one commit per accepted change, scores in the body │
└─────────────────────────────────────────────────────────────────┘
                          ▲                              │
                          └──────── run again ───────────┘

3. Component A — the eval runner (frozen set → component scores)

Run the system over a frozen JSONL dataset and return per-category scores, not a single number. Each example carries a gold label and a category tag. The runner is deterministic on the dataset (temperature: 0) so a re-run with the same prompt is reproducible — without that, you can't tell a 2-point move from sampling noise.

scripts/eval-run.mjs:

import fs from "node:fs";
import { runSystem } from "../src/system.mjs";   // the prompt/pipeline under test
import { judge } from "./judge.mjs";             // LLM-as-judge, see §6

const SET = process.argv[2] ?? "dev";            // "dev" | "heldout"
const data = fs.readFileSync(`evals/v3/${SET}.jsonl`, "utf8")
  .trim().split("\n").map(JSON.parse);           // {id, input, gold, category}

const COMPONENTS = ["accuracy", "format", "safety"];   // + latency, measured below
const THRESH = { accuracy: 0.80, format: 0.95, safety: 0.99, p95_latency_ms: 4000 };

const rows = [];
for (const ex of data) {
  const t0 = Date.now();
  const out = await runSystem(ex.input);                // the thing we're hill-climbing
  const latency = Date.now() - t0;
  // format + safety are CHEAP deterministic checks — never pay a judge for what a regex
  // can decide. Only the fuzzy "is the answer right" goes to the LLM judge.
  const format = /^\{[\s\S]*\}$/.test(out.raw) && hasRequiredKeys(out) ? 1 : 0;
  const safety = containsPII(out.text) || isUnsafe(out.text) ? 0 : 1;
  const accuracy = await judge(ex.input, out.text, ex.gold);   // 0..1, LLM-as-judge
  rows.push({ id: ex.id, category: ex.category, accuracy, format, safety, latency });
}

// Aggregate per component AND per category — a regression hides in a category.
const mean = (xs) => xs.reduce((a, b) => a + b, 0) / xs.length;
const overall = Object.fromEntries(COMPONENTS.map(c => [c, mean(rows.map(r => r[c]))]));
overall.p95_latency_ms = pct(rows.map(r => r.latency), 95);

const byCategory = {};
for (const cat of new Set(rows.map(r => r.category)))
  byCategory[cat] = mean(rows.filter(r => r.category === cat).map(r => r.accuracy));

// Aggregate = weighted blend, BUT the gate also checks every component & category raw.
const aggregate = 0.6 * overall.accuracy + 0.25 * overall.format + 0.15 * overall.safety;

fs.writeFileSync(`evals/v3/runs/${SET}-${Date.now()}.json`,
  JSON.stringify({ aggregate, overall, byCategory, n: rows.length, rows }, null, 2));
console.log({ n: rows.length, aggregate: aggregate.toFixed(4), overall, byCategory });

The blended aggregate exists for the headline; the overall components and byCategory map exist so the gate can catch the regression the headline hides. Always persist the per-row rows — when a category drops you need to read the exact examples that flipped, not guess.


4. Component B — the hill-climb loop (propose one → re-eval → accept or revert)

An agent reads the failing rows from the baseline run, proposes exactly ONE change, the loop re-runs the same dev set, and the gate decides keep vs revert. The agent never sees the held-out set — that split is sacred.

scripts/hill-climb.mjs:

import { run, gitSnapshot, gitRevert } from "./util.mjs";
import { agent } from "./orchestrator.mjs";

const N_ITERS = 12, K_HELDOUT = 3;
let baseline = JSON.parse(run("node scripts/eval-run.mjs dev"));   // §3 output
let consecutiveNoGain = 0, sinceHeldout = 0;

for (let i = 1; i <= N_ITERS; i++) {
  // 1) Snapshot so a rejected change is a clean git revert, not a manual undo.
  const snap = gitSnapshot();

  // 2) Agent proposes ONE change, grounded in the WORST rows (default to attacking the
  //    lowest category, not the lowest single example — fix classes of failure).
  const worst = baseline.rows.filter(r => r.accuracy < 1)
    .sort((a, b) => a.accuracy - b.accuracy).slice(0, 12);
  const change = await agent(PROPOSE_GUIDE, {
    schema: { kind: "prompt|fewshot|tool|guardrail", target: "file:line", diff: "string", rationale: "string" },
    context: { worstRows: worst, lowestCategory: argminCategory(baseline.byCategory) },
    label: `propose:iter${i}`,
  });
  applyDiff(change.diff);   // mutate exactly one artifact (the prompt / a fewshot file / …)

  // 3) Re-run the SAME frozen dev set.
  const next = JSON.parse(run("node scripts/eval-run.mjs dev"));

  // 4) THE GATE (see §5). Keep only if it hill-climbs and regresses nothing.
  if (passesGate(baseline, next)) {
    run(`git add -A && git commit -m "eval: iter${i} ${change.kind} ${change.target} ` +
        `(${baseline.aggregate.toFixed(3)}→${next.aggregate.toFixed(3)})"`);
    const gain = next.aggregate - baseline.aggregate;
    consecutiveNoGain = gain > 0.002 ? 0 : consecutiveNoGain + 1;   // noise band, see §8
    baseline = next;
    if (++sinceHeldout >= K_HELDOUT) { checkHeldout(baseline); sinceHeldout = 0; }
  } else {
    gitRevert(snap);   // a rejected change leaves ZERO trace in the tree
    console.log(`iter${i} REVERTED:`, change.kind, change.target, "—", gateReason(baseline, next));
    consecutiveNoGain++;
  }

  // 5) Plateau → stop (or expand the set, §8). No silent caps: we log every revert/no-gain.
  if (consecutiveNoGain >= 4) { console.log("PLATEAU — stopping or expand eval set"); break; }
}

The propose-guide (the contract)

You are hill-climbing a quality score. Propose EXACTLY ONE change.

INPUT:  the 12 worst-scoring eval rows + the lowest-scoring category.
OUTPUT: one change of kind {prompt edit | few-shot example | tool | guardrail}.

HARD RULES:
  - ONE change. Not "tighten the prompt AND add an example." One. A batched change is
    un-attributable and will be rejected by review even if the score goes up.
  - Target a CLASS of failure visible across multiple worst rows, not one example
    (fixing one example = overfitting the eval, which the held-out split will punish).
  - A new few-shot example MUST come from a REAL failure, paraphrased — never copy a
    held-out OR a dev example verbatim into the prompt (that's leaking the test into the
    system; the score becomes a lookup, not a capability).
  - Do NOT touch the eval set, the gold labels, the judge, or the thresholds. Those are
    frozen. You change the SYSTEM, the measurement stays still.
  - State, in `rationale`, which failure class this attacks and why it shouldn't regress
    the others (the gate will check the claim).

5. The gate (non-negotiable)

A change ships iff both hold, on the frozen dev set:

export function passesGate(base, next, opts = {}) {
  const NOISE = opts.noise ?? 0.002;          // smaller than this = noise, not a win (§8)
  const CAT_DROP = opts.catDrop ?? 0.03;      // any category may not fall more than this
  const HARD = { format: 0.95, safety: 0.99, p95_latency_ms: 4000 };

  if (next.aggregate < base.aggregate - NOISE) return false;          // overall must not drop
  for (const c of Object.keys(next.byCategory))
    if (next.byCategory[c] < (base.byCategory[c] ?? 0) - CAT_DROP) return false;  // no category regression
  if (next.overall.format   < HARD.format)   return false;            // absolute floors
  if (next.overall.safety   < HARD.safety)   return false;            // safety NEVER negotiable
  if (next.overall.p95_latency_ms > HARD.p95_latency_ms) return false;// a "smarter" prompt that doubles latency loses
  return true;
}

The gate is what makes autonomous proposing safe. Never merge a change that lowers the aggregate or regresses a category, however nice it looks — the classic trap is a change that lifts format from 0.92 → 0.98 while accuracy slides 0.84 → 0.80, netting a higher blend while the product gets worse. The per-category check kills it. Safety has a hard floor, not a "don't regress" — a change that introduces one unsafe output is rejected even if it fixes a hundred others.

Hard-won gotchas (each one cost a real iteration)

SymptomCauseFix
Dev score climbs every iter, held-out flat/sinkingOverfitting — agent is fixing eval-specific quirks (or a fewshot leaked from the set)Run held-out every K accepts; if it diverges, revert back to the last iter where they tracked and widen the dev set
A "regression" appears system-wide with no code changeJudge drift — the LLM grader's bar moved (model update, or its own temperature)Pin the judge model+version; temperature:0; re-score baseline AND next in the SAME run, never compare across days
A 2-point move flips between two re-runs of the same promptEval set too small — noise > signal (20 examples ⇒ ±5pts is air)≥150 dev examples; treat moves < the noise band (0.002 here, measured by re-running the baseline 3×) as zero
Number goes up, outputs read worse to a humanGoodhart — optimized format/length the judge rewards, not answer quality3–5 qualitative spot-checks per iter; if they read worse, the metric is gamed — fix the judge rubric, then re-baseline
accuracy up, but safety quietly at 0.97Blended aggregate absorbed a safety regressionSafety is a hard FLOOR in the gate, never a "don't drop by X" — one unsafe output rejects the change
Rejected changes pile up in the treeapplyDiff mutated files and revert was manualSnapshot with git BEFORE applying; a failed gate is git revert, leaving zero trace

6. The judge — and validating it

accuracy is fuzzy, so an LLM grades it against the gold answer. But the judge is an instrument, and instruments drift. Pin it, constrain it, and spot-check it against humans.

// judge.mjs — pinned model, temperature 0, a RUBRIC not a vibe, structured 0/1 verdict.
export async function judge(input, candidate, gold) {
  const r = await llm({
    model: "claude-3-5-sonnet-20241022",   // PINNED — never "latest"; a model swap is judge drift
    temperature: 0,
    system: `Grade if CANDIDATE answers INPUT as well as GOLD. Score 1 only if it is
             factually correct AND addresses the ask. Length/politeness do NOT count.
             Return JSON {"score":0|1,"reason":"..."} only.`,
    messages: [{ role: "user", content: `INPUT:\n${input}\n\nGOLD:\n${gold}\n\nCANDIDATE:\n${candidate}` }],
  });
  return JSON.parse(r).score;
}

Validate the judge every ~4 iterations: pull 20 graded rows, have a human label the same 20 blind, and compute agreement. If judge↔human agreement drops below ~0.85, the judge — not the system — moved; fix the rubric and re-baseline everything, because every prior score was measured with a different ruler.


7. Commit & cadence

  • One commit per ACCEPTED change, with the before→after aggregate and the changed category in the body. Each commit is a verified hill-climb step — bisectable, revertible, and a complete audit trail of which change bought which point.
  • Reverted changes leave no commit — only a log line (iterN REVERTED: <kind> <target> — <reason>). The git history is the list of things that actually worked.
  • Re-baseline on any frozen-asset change (judge model, threshold, eval set version). You cannot compare scores measured with two different rulers — bump evals/vN and start the rounds table fresh.

8. When to stop (convergence)

Two signals, read together: the dev score plateaus AND the held-out score tracks it (if held-out diverges below dev, you're overfitting and should stop earlier than the plateau). From this run:

IterChange keptDev aggregateHeld-out aggregateNote
0(baseline)0.7120.701
2+fewshot (refund category)0.7580.749tracks
4guardrail: PII redact0.7700.762safety floor held
5prompt: tighten classifierrevertedaccuracy +, format −0.05 → gate kill
7tool: order-status lookup0.8310.820tracks
9+fewshot (verbatim dev row)0.8580.819held-out STALLED → overfit, reverted
11prompt: structured output0.8810.861tracks
13+fewshot (edge case)0.8860.864+0.005, near noise
14prompt: tone0.8890.863< noise band → no-gain #4

Iter 9 is the lesson: dev jumped, held-out didn't move — a few-shot that paraphrased a real failure too closely turned into a lookup. The held-out split caught it; we reverted.

After 4 consecutive no-gain iterations (moves inside the 0.002 noise band, measured by re-running the baseline 3× and taking the spread), the broad hill-climb is tapped out. Then don't keep grinding — EXPAND the eval set. Mine production failures the current set doesn't cover, add 30–50 examples, bump to evals/v(N+1), re-baseline, and resume. A plateau on a stale eval set means you've climbed the hill you can see — a bigger set surfaces a new one. Plateau ≠ done; plateau = the eval set has stopped teaching you.


9. How to re-run it

# 1. Baseline the current system on the FROZEN dev set.
node scripts/eval-run.mjs dev           # → evals/v3/runs/dev-*.json  (component scores)

# 2. Run the hill-climb loop (propose ONE change → re-eval → gate → keep/revert).
node scripts/hill-climb.mjs             # commits accepts, logs+reverts rejects

# 3. Check the held-out split tracks (run it yourself; the loop never optimizes it).
node scripts/eval-run.mjs heldout       # must track dev, or you've overfit

# 4. Validate the judge against humans (every ~4 iters).
node scripts/judge-audit.mjs --n 20     # judge↔human agreement ≥ 0.85, else re-baseline

# 5. Plateaued? Grow the set and bump the version, don't grind a dead eval.
node scripts/mine-failures.mjs --from prod-logs --n 40 --out evals/v4/dev.jsonl

10. Why it works

  • The frozen eval makes "better" a measurement, not a vibe — every change is judged by the same ruler, so the delta means something instead of being post-hoc rationalization.
  • One change per iteration makes the delta attributable — when the score moves, you know exactly which change moved it, so you keep what works and revert what doesn't with certainty, not hunches.
  • Component scores make regressions visible — the per-category gate catches the change that lifts the headline while quietly degrading the product; a single number would have shipped it.
  • The held-out split makes overfitting detectable — it's the only signal that separates "improved the system" from "memorized the eval," and it's the thing that told us iter 9 was a fraud.
  • The gate makes autonomous proposing safe — an agent can propose freely because nothing that lowers the score or breaks a category floor can survive to commit. A change that doesn't pass the gate didn't happen.

Generated from a support-ticket classifier+responder eval loop (14 iterations, 180 dev / 60 held-out, dev 0.71 → 0.89 tracked by held-out 0.70 → 0.86, 5 changes gate-reverted). Reusable for any prompt / pipeline / agent where quality must be measured, not vibed.

Install this skill directly: skilldb add agentic-loops-skills

Get CLI access →