Technology & EngineeringAgentic Loops371 lines

eval-driven-loop

An eval → improve-one-thing → re-eval hill-climbing loop for developing an LLM feature

Quick Summary21 lines

A loop that turns "this prompt feels better" into "this prompt scores 3 points higher and
regressed nothing." You freeze a labeled eval set, **run the current system over it, score
each component, propose exactly ONE change, re-run the SAME evals, and keep the change only
if it hill-climbs** — score holds or rises, and no category drops past its threshold. If it

## Key Points

- **The eval set is the source of truth — freeze it.** A change is "good" iff the frozen
- **ONE change per iteration.** A prompt edit, a few-shot example, a tool, a guardrail —
- **Score COMPONENTS, never one number.** accuracy, format-compliance, safety, latency.
- **Hold out a TEST split you NEVER optimize against.** You will overfit the dev set — it's
- **The LLM-as-judge needs its OWN validation.** A judge model is a measuring instrument
- **Beware Goodhart.** The metric is a proxy for the goal, not the goal. Keep 3–5
- **A hard regression gate, every iteration.** Aggregate must not drop AND no category may
- ONE change. Not "tighten the prompt AND add an example." One. A batched change is
- Target a CLASS of failure visible across multiple worst rows, not one example
- A new few-shot example MUST come from a REAL failure, paraphrased — never copy a
- Do NOT touch the eval set, the gold labels, the judge, or the thresholds. Those are
- State, in `rationale`, which failure class this attacks and why it shouldn't regress

skilldb get agentic-loops-skills/eval-driven-loopFull skill: 371 lines

Paste into your CLAUDE.md or agent config

Eval-Driven Loop

A loop that turns "this prompt feels better" into "this prompt scores 3 points higher and regressed nothing." You freeze a labeled eval set, run the current system over it, score each component, propose exactly ONE change, re-run the SAME evals, and keep the change only if it hill-climbs — score holds or rises, and no category drops past its threshold. If it doesn't, you revert. Repeat until the dev score plateaus while the held-out score tracks it.

It was built and run 14 iterations against a support-ticket classifier+responder (180 dev examples, 60 held-out). The blended dev score moved 71 → 89; held-out moved 70 → 86 (it tracked, so we weren't overfitting). 5 of the 14 proposed changes were reverted by the gate — they raised one category and quietly tanked another.

1. The philosophy

The eval set is the source of truth — freeze it. A change is "good" iff the frozen set says so. The instant you tweak the prompt and the eval set in the same iteration, your delta is meaningless. Version the eval set in git; bump its version only between iterations, never during one.
ONE change per iteration. A prompt edit, a few-shot example, a tool, a guardrail — one. Batch two and you can't tell which one helped and which one secretly hurt. The whole point is an attributable delta. Two changes = no signal.
Score COMPONENTS, never one number. accuracy, format-compliance, safety, latency. A single blended score hides regressions — "format compliance jumped" can mask "the actual answers got worse." You optimize what you can see.
Hold out a TEST split you NEVER optimize against. You will overfit the dev set — it's not a moral failing, it's gradient following. The held-out split is the only thing that tells you whether you improved the system or just the eval. If dev climbs and held-out doesn't, you're memorizing.
The LLM-as-judge needs its OWN validation. A judge model is a measuring instrument with drift. Spot-check ~20 of its verdicts against human labels every few iterations. If the judge silently raised its bar, your "regression" is the judge, not the system.
Beware Goodhart. The metric is a proxy for the goal, not the goal. Keep 3–5 qualitative spot-checks per iteration — if the number went up but the outputs read worse, the metric is being gamed and you've optimized the proxy off a cliff.
A hard regression gate, every iteration. Aggregate must not drop AND no category may fall past its threshold. A change that lowers the score didn't happen, however nice the diff looks.

2. The loop (one iteration)

┌─────────────────────────────────────────────────────────────────┐
│  0. FREEZE    eval set vN (dev split + held-out split), in git   │
│  1. BASELINE  run system over DEV set → component scores         │
│  2. PROPOSE   agent reads failures → ONE change (prompt/shot/    │
│               tool/guardrail), nothing else                      │
│  3. RE-EVAL   run the SAME dev set → new component scores        │
│  4. GATE      aggregate ≥ baseline  AND  no category < threshold │
│       PASS → keep the change, set baseline = new                 │
│       FAIL → REVERT the change (git checkout), record why        │
│  5. HELD-OUT  every K accepts: run held-out → must track dev     │
│  6. JUDGE-CK  every K iters: 20 judge verdicts vs human labels   │
│  7. COMMIT    one commit per accepted change, scores in the body │
└─────────────────────────────────────────────────────────────────┘
                          ▲                              │
                          └──────── run again ───────────┘

3. Component A — the eval runner (frozen set → component scores)

Run the system over a frozen JSONL dataset and return per-category scores, not a single number. Each example carries a gold label and a category tag. The runner is deterministic on the dataset (temperature: 0) so a re-run with the same prompt is reproducible — without that, you can't tell a 2-point move from sampling noise.

scripts/eval-run.mjs:

import fs from "node:fs";
import { runSystem } from "../src/system.mjs";   // the prompt/pipeline under test
import { judge } from "./judge.mjs";             // LLM-as-judge, see §6

const SET = process.argv[2] ?? "dev";            // "dev" | "heldout"
const data = fs.readFileSync(`evals/v3/${SET}.jsonl`, "utf8")
  .trim().split("\n").map(JSON.parse);           // {id, input, gold, category}

const COMPONENTS = ["accuracy", "format", "safety"];   // + latency, measured below
const THRESH = { accuracy: 0.80, format: 0.95, safety: 0.99, p95_latency_ms: 4000 };

const rows = [];
for (const ex of data) {
  const t0 = Date.now();
  const out = await runSystem(ex.input);                // the thing we're hill-climbing
  const latency = Date.now() - t0;
  // format + safety are CHEAP deterministic checks — never pay a judge for what a regex
  // can decide. Only the fuzzy "is the answer right" goes to the LLM judge.
  const format = /^\{[\s\S]*\}$/.test(out.raw) && hasRequiredKeys(out) ? 1 : 0;
  const safety = containsPII(out.text) || isUnsafe(out.text) ? 0 : 1;
  const accuracy = await judge(ex.input, out.text, ex.gold);   // 0..1, LLM-as-judge
  rows.push({ id: ex.id, category: ex.category, accuracy, format, safety, latency });
}

// Aggregate per component AND per category — a regression hides in a category.
const mean = (xs) => xs.reduce((a, b) => a + b, 0) / xs.length;
const overall = Object.fromEntries(COMPONENTS.map(c => [c, mean(rows.map(r => r[c]))]));
overall.p95_latency_ms = pct(rows.map(r => r.latency), 95);

const byCategory = {};
for (const cat of new Set(rows.map(r => r.category)))
  byCategory[cat] = mean(rows.filter(r => r.category === cat).map(r => r.accuracy));

// Aggregate = weighted blend, BUT the gate also checks every component & category raw.
const aggregate = 0.6 * overall.accuracy + 0.25 * overall.format + 0.15 * overall.safety;

fs.writeFileSync(`evals/v3/runs/${SET}-${Date.now()}.json`,
  JSON.stringify({ aggregate, overall, byCategory, n: rows.length, rows }, null, 2));
console.log({ n: rows.length, aggregate: aggregate.toFixed(4), overall, byCategory });

The blended aggregate exists for the headline; the overall components and byCategory map exist so the gate can catch the regression the headline hides. Always persist the per-row rows — when a category drops you need to read the exact examples that flipped, not guess.

4. Component B — the hill-climb loop (propose one → re-eval → accept or revert)

An agent reads the failing rows from the baseline run, proposes exactly ONE change, the loop re-runs the same dev set, and the gate decides keep vs revert. The agent never sees the held-out set — that split is sacred.

scripts/hill-climb.mjs:

import { run, gitSnapshot, gitRevert } from "./util.mjs";
import { agent } from "./orchestrator.mjs";

const N_ITERS = 12, K_HELDOUT = 3;
let baseline = JSON.parse(run("node scripts/eval-run.mjs dev"));   // §3 output
let consecutiveNoGain = 0, sinceHeldout = 0;

for (let i = 1; i <= N_ITERS; i++) {
  // 1) Snapshot so a rejected change is a clean git revert, not a manual undo.
  const snap = gitSnapshot();

  // 2) Agent proposes ONE change, grounded in the WORST rows (default to attacking the
  //    lowest category, not the lowest single example — fix classes of failure).
  const worst = baseline.rows.filter(r => r.accuracy < 1)
    .sort((a, b) => a.accuracy - b.accuracy).slice(0, 12);
  const change = await agent(PROPOSE_GUIDE, {
    schema: { kind: "prompt|fewshot|tool|guardrail", target: "file:line", diff: "string", rationale: "string" },
    context: { worstRows: worst, lowestCategory: argminCategory(baseline.byCategory) },
    label: `propose:iter${i}`,
  });
  applyDiff(change.diff);   // mutate exactly one artifact (the prompt / a fewshot file / …)

  // 3) Re-run the SAME frozen dev set.
  const next = JSON.parse(run("node scripts/eval-run.mjs dev"));

  // 4) THE GATE (see §5). Keep only if it hill-climbs and regresses nothing.
  if (passesGate(baseline, next)) {
    run(`git add -A && git commit -m "eval: iter${i} ${change.kind} ${change.target} ` +
        `(${baseline.aggregate.toFixed(3)}→${next.aggregate.toFixed(3)})"`);
    const gain = next.aggregate - baseline.aggregate;
    consecutiveNoGain = gain > 0.002 ? 0 : consecutiveNoGain + 1;   // noise band, see §8
    baseline = next;
    if (++sinceHeldout >= K_HELDOUT) { checkHeldout(baseline); sinceHeldout = 0; }
  } else {
    gitRevert(snap);   // a rejected change leaves ZERO trace in the tree
    console.log(`iter${i} REVERTED:`, change.kind, change.target, "—", gateReason(baseline, next));
    consecutiveNoGain++;
  }

  // 5) Plateau → stop (or expand the set, §8). No silent caps: we log every revert/no-gain.
  if (consecutiveNoGain >= 4) { console.log("PLATEAU — stopping or expand eval set"); break; }
}

The propose-guide (the contract)

You are hill-climbing a quality score. Propose EXACTLY ONE change.

INPUT:  the 12 worst-scoring eval rows + the lowest-scoring category.
OUTPUT: one change of kind {prompt edit | few-shot example | tool | guardrail}.

HARD RULES:
  - ONE change. Not "tighten the prompt AND add an example." One. A batched change is
    un-attributable and will be rejected by review even if the score goes up.
  - Target a CLASS of failure visible across multiple worst rows, not one example
    (fixing one example = overfitting the eval, which the held-out split will punish).
  - A new few-shot example MUST come from a REAL failure, paraphrased — never copy a
    held-out OR a dev example verbatim into the prompt (that's leaking the test into the
    system; the score becomes a lookup, not a capability).
  - Do NOT touch the eval set, the gold labels, the judge, or the thresholds. Those are
    frozen. You change the SYSTEM, the measurement stays still.
  - State, in `rationale`, which failure class this attacks and why it shouldn't regress
    the others (the gate will check the claim).

5. The gate (non-negotiable)

A change ships iff both hold, on the frozen dev set:

export function passesGate(base, next, opts = {}) {
  const NOISE = opts.noise ?? 0.002;          // smaller than this = noise, not a win (§8)
  const CAT_DROP = opts.catDrop ?? 0.03;      // any category may not fall more than this
  const HARD = { format: 0.95, safety: 0.99, p95_latency_ms: 4000 };

  if (next.aggregate < base.aggregate - NOISE) return false;          // overall must not drop
  for (const c of Object.keys(next.byCategory))
    if (next.byCategory[c] < (base.byCategory[c] ?? 0) - CAT_DROP) return false;  // no category regression
  if (next.overall.format   < HARD.format)   return false;            // absolute floors
  if (next.overall.safety   < HARD.safety)   return false;            // safety NEVER negotiable
  if (next.overall.p95_latency_ms > HARD.p95_latency_ms) return false;// a "smarter" prompt that doubles latency loses
  return true;
}

The gate is what makes autonomous proposing safe. Never merge a change that lowers the aggregate or regresses a category, however nice it looks — the classic trap is a change that lifts format from 0.92 → 0.98 while accuracy slides 0.84 → 0.80, netting a higher blend while the product gets worse. The per-category check kills it. Safety has a hard floor, not a "don't regress" — a change that introduces one unsafe output is rejected even if it fixes a hundred others.

Hard-won gotchas (each one cost a real iteration)

Symptom	Cause	Fix
Dev score climbs every iter, held-out flat/sinking	Overfitting — agent is fixing eval-specific quirks (or a fewshot leaked from the set)	Run held-out every K accepts; if it diverges, revert back to the last iter where they tracked and widen the dev set
A "regression" appears system-wide with no code change	Judge drift — the LLM grader's bar moved (model update, or its own temperature)	Pin the judge model+version; `temperature:0`; re-score baseline AND next in the SAME run, never compare across days
A 2-point move flips between two re-runs of the same prompt	Eval set too small — noise > signal (20 examples ⇒ ±5pts is air)	≥150 dev examples; treat moves < the noise band (`0.002` here, measured by re-running the baseline 3×) as zero
Number goes up, outputs read worse to a human	Goodhart — optimized format/length the judge rewards, not answer quality	3–5 qualitative spot-checks per iter; if they read worse, the metric is gamed — fix the judge rubric, then re-baseline
`accuracy` up, but `safety` quietly at 0.97	Blended aggregate absorbed a safety regression	Safety is a hard FLOOR in the gate, never a "don't drop by X" — one unsafe output rejects the change
Rejected changes pile up in the tree	`applyDiff` mutated files and revert was manual	Snapshot with git BEFORE applying; a failed gate is `git revert`, leaving zero trace

6. The judge — and validating it

accuracy is fuzzy, so an LLM grades it against the gold answer. But the judge is an instrument, and instruments drift. Pin it, constrain it, and spot-check it against humans.

// judge.mjs — pinned model, temperature 0, a RUBRIC not a vibe, structured 0/1 verdict.
export async function judge(input, candidate, gold) {
  const r = await llm({
    model: "claude-3-5-sonnet-20241022",   // PINNED — never "latest"; a model swap is judge drift
    temperature: 0,
    system: `Grade if CANDIDATE answers INPUT as well as GOLD. Score 1 only if it is
             factually correct AND addresses the ask. Length/politeness do NOT count.
             Return JSON {"score":0|1,"reason":"..."} only.`,
    messages: [{ role: "user", content: `INPUT:\n${input}\n\nGOLD:\n${gold}\n\nCANDIDATE:\n${candidate}` }],
  });
  return JSON.parse(r).score;
}

Validate the judge every ~4 iterations: pull 20 graded rows, have a human label the same 20 blind, and compute agreement. If judge↔human agreement drops below ~0.85, the judge — not the system — moved; fix the rubric and re-baseline everything, because every prior score was measured with a different ruler.

7. Commit & cadence

One commit per ACCEPTED change, with the before→after aggregate and the changed category in the body. Each commit is a verified hill-climb step — bisectable, revertible, and a complete audit trail of which change bought which point.
Reverted changes leave no commit — only a log line (iterN REVERTED: <kind> <target> — <reason>). The git history is the list of things that actually worked.
Re-baseline on any frozen-asset change (judge model, threshold, eval set version). You cannot compare scores measured with two different rulers — bump evals/vN and start the rounds table fresh.

8. When to stop (convergence)

Two signals, read together: the dev score plateaus AND the held-out score tracks it (if held-out diverges below dev, you're overfitting and should stop earlier than the plateau). From this run:

Iter	Change kept	Dev aggregate	Held-out aggregate	Note
0	(baseline)	0.712	0.701	—
2	+fewshot (refund category)	0.758	0.749	tracks
4	guardrail: PII redact	0.770	0.762	safety floor held
5	prompt: tighten classifier	reverted	—	accuracy +, format −0.05 → gate kill
7	tool: order-status lookup	0.831	0.820	tracks
9	+fewshot (verbatim dev row)	0.858	0.819	held-out STALLED → overfit, reverted
11	prompt: structured output	0.881	0.861	tracks
13	+fewshot (edge case)	0.886	0.864	+0.005, near noise
14	prompt: tone	0.889	0.863	< noise band → no-gain #4

Iter 9 is the lesson: dev jumped, held-out didn't move — a few-shot that paraphrased a real failure too closely turned into a lookup. The held-out split caught it; we reverted.

After 4 consecutive no-gain iterations (moves inside the 0.002 noise band, measured by re-running the baseline 3× and taking the spread), the broad hill-climb is tapped out. Then don't keep grinding — EXPAND the eval set. Mine production failures the current set doesn't cover, add 30–50 examples, bump to evals/v(N+1), re-baseline, and resume. A plateau on a stale eval set means you've climbed the hill you can see — a bigger set surfaces a new one. Plateau ≠ done; plateau = the eval set has stopped teaching you.

9. How to re-run it

# 1. Baseline the current system on the FROZEN dev set.
node scripts/eval-run.mjs dev           # → evals/v3/runs/dev-*.json  (component scores)

# 2. Run the hill-climb loop (propose ONE change → re-eval → gate → keep/revert).
node scripts/hill-climb.mjs             # commits accepts, logs+reverts rejects

# 3. Check the held-out split tracks (run it yourself; the loop never optimizes it).
node scripts/eval-run.mjs heldout       # must track dev, or you've overfit

# 4. Validate the judge against humans (every ~4 iters).
node scripts/judge-audit.mjs --n 20     # judge↔human agreement ≥ 0.85, else re-baseline

# 5. Plateaued? Grow the set and bump the version, don't grind a dead eval.
node scripts/mine-failures.mjs --from prod-logs --n 40 --out evals/v4/dev.jsonl

10. Why it works

The frozen eval makes "better" a measurement, not a vibe — every change is judged by the same ruler, so the delta means something instead of being post-hoc rationalization.
One change per iteration makes the delta attributable — when the score moves, you know exactly which change moved it, so you keep what works and revert what doesn't with certainty, not hunches.
Component scores make regressions visible — the per-category gate catches the change that lifts the headline while quietly degrading the product; a single number would have shipped it.
The held-out split makes overfitting detectable — it's the only signal that separates "improved the system" from "memorized the eval," and it's the thing that told us iter 9 was a fraud.
The gate makes autonomous proposing safe — an agent can propose freely because nothing that lowers the score or breaks a category floor can survive to commit. A change that doesn't pass the gate didn't happen.

Generated from a support-ticket classifier+responder eval loop (14 iterations, 180 dev / 60 held-out, dev 0.71 → 0.89 tracked by held-out 0.70 → 0.86, 5 changes gate-reverted). Reusable for any prompt / pipeline / agent where quality must be measured, not vibed.

Install this skill directly: skilldb add agentic-loops-skills

Get CLI access →

Eval-Driven Loop

1. The philosophy

2. The loop (one iteration)

3. Component A — the eval runner (frozen set → component scores)

4. Component B — the hill-climb loop (propose one → re-eval → accept or revert)

The propose-guide (the contract)

5. The gate (non-negotiable)

Hard-won gotchas (each one cost a real iteration)

6. The judge — and validating it

7. Commit & cadence

8. When to stop (convergence)

9. How to re-run it

1. Baseline the current system on the FROZEN dev set.

2. Run the hill-climb loop (propose ONE change → re-eval → gate → keep/revert).

3. Check the held-out split tracks (run it yourself; the loop never optimizes it).

4. Validate the judge against humans (every ~4 iters).

5. Plateaued? Grow the set and bump the version, don't grind a dead eval.

10. Why it works

Details

Pack: agentic-loops-skills
File: eval-driven-loop.md
Lines: 371
Category: Technology & Engineering

Download via CLI

Pro

$ skilldb add agentic-loops-skills

Installs the full Agentic Loops pack to your project.

eval-driven-loop

Eval-Driven Loop

1. The philosophy

2. The loop (one iteration)

3. Component A — the eval runner (frozen set → component scores)

4. Component B — the hill-climb loop (propose one → re-eval → accept or revert)

The propose-guide (the contract)

5. The gate (non-negotiable)

Hard-won gotchas (each one cost a real iteration)

6. The judge — and validating it

7. Commit & cadence

8. When to stop (convergence)

9. How to re-run it

10. Why it works

Related Skills

bug-hunt-loop

data-backfill-loop

migration-loop

refactor-under-tests-loop

research-synthesis-loop

self-improvement-loop