test-and-fix-loop
A red→green agentic loop for implementing or repairing code against an existing test suite. One iteration runs the test command, parses the FIRST failure, hands an agent only that failure plus the relevant source, takes the MINIMAL diff, and re-runs the WHOLE suite. The test command's exit code is the only oracle — green or it didn't happen. Includes a tamper guard (reject if the agent shrank or weakened the tests to go green), no-progress detection (same failure signature twice = stuck, escalate), flaky-test quarantine (a failure that passes on bare re-run is flaky, not a bug), environment-vs-test triage (a setup failure loops forever if treated as a code failure), and fan-out across independent failing modules. Use when you have a failing test suite or a broken build and want an agent to drive it to green safely — without it cheating by deleting assertions, looping on a flake, or "fixing" snapshots by regenerating them.
A loop that drives a failing test suite to green, one failure at a time, with the **runner
as the only judge**. Each iteration: run the tests, grab the **first** failure, give an
agent *that one failure plus the source it touches* — nothing more — let it make the
**smallest diff that could pass**, then re-run the entire suite. Repeat until the exit code
## Key Points
- **One failure per iteration. Never batch.** Fix failure #1, re-run, *then* look at what's
- **The runner is the only oracle.** Not the agent's belief that it's fixed, not a partial
- **Minimal diff while red. Don't refactor in the dark.** A failing suite is a terrible
- **Re-run the WHOLE suite, not just the one test.** The fix for failure #1 routinely
- **Going green by weakening the test is a failure, not a pass.** Deleting an assertion,
- **A flake is not a bug.** A test that fails then passes on a bare re-run (no code change)
- **Fan out only across independent failures.** Two failing test *files* that share no
- Edit ONLY application/source files, never anything under ${TEST_GLOB}.
- Smallest diff that makes this test pass. No refactors, no renames, no new deps.
- If the test asserts behavior the code can't provide without a redesign, STOP and explain — don't hack the test.`,
1. **Full suite, every iteration.** A green single test with a red suite is a regression you
2. **The failing count must trend strictly down.** `11 → 6 → 3 → 1 → 0`. If it goes up or
## Quick Example
```bash
pytest -q # exit 0 = green. Anything else = still red, regardless of what the agent says.
echo $?
```skilldb get agentic-loops-skills/test-and-fix-loopFull skill: 348 linesTest-and-Fix Loop (Red→Green)
A loop that drives a failing test suite to green, one failure at a time, with the runner as the only judge. Each iteration: run the tests, grab the first failure, give an agent that one failure plus the source it touches — nothing more — let it make the smallest diff that could pass, then re-run the entire suite. Repeat until the exit code is 0, or until a max-iteration / no-progress guard fires and hands it back to you.
It is deliberately boring and paranoid. The interesting work is not "make it pass" — an agent will happily make anything pass by deleting the assertion. The interesting work is the guards that make "pass" mean something: tamper detection, flake quarantine, and a stop signal that fires before the loop starts spinning.
1. The philosophy
- One failure per iteration. Never batch. Fix failure #1, re-run, then look at what's left. Batch a "fix these 4" prompt and you lose attribution: when 2 pass and 2 regress you can't tell which edit did what, and the agent starts shotgun-editing to move the number. One change, one re-run, one verdict.
- The runner is the only oracle. Not the agent's belief that it's fixed, not a partial
test run, not "looks right."
pytest -x; echo $?→0or it didn't happen. Reasoning about correctness is input; the exit code is truth. - Minimal diff while red. Don't refactor in the dark. A failing suite is a terrible place to "clean things up" — you can't tell a fix from a regression. Make the surgical change that flips this one test. Refactor after green, with the suite as a net.
- Re-run the WHOLE suite, not just the one test. The fix for failure #1 routinely breaks failure #5 (shared fixture, global state, an off-by-one now exposed). If you only re-run the test you "fixed," you ship the regression. The full suite is the gate; one test is just the cursor.
- Going green by weakening the test is a failure, not a pass. Deleting an assertion,
loosening
assertEqualtoassertTrue, adding@skip, or--snapshot-updateis the agent gaming the gate. Diff the test files every iteration; if they shrank or softened, reject the change. - A flake is not a bug. A test that fails then passes on a bare re-run (no code change) is non-deterministic. "Fixing" it means chasing a phantom forever. Quarantine and classify it; don't loop on it.
- Fan out only across independent failures. Two failing test files that share no fixture/db/global can be fixed in parallel agents in separate worktrees. Failures inside the same module usually share state — serialize those.
2. The loop (one iteration)
┌──────────────────────────────────────────────────────────────────────┐
│ 1. RUN test cmd → capture stdout/stderr + EXIT CODE │
│ exit 0 ? ──yes──► DONE (green) │
│ 2. PARSE extract the FIRST failure: │
│ { test_id, file:line, expected/actual, traceback } │
│ 3. SIGNATURE hash(test_id + top frame). Same as last iter? │
│ ──yes──► NO-PROGRESS → escalate, stop │
│ 4. FLAKE? re-run just this test once, no edits. │
│ passed ? ──yes──► quarantine + classify, skip it │
│ 5. AGENT give it ONLY: the failure + the source file(s) it names. │
│ → MINIMAL diff to make this one test pass. │
│ 6. GUARD git diff the TEST files. Shrunk / weakened / skipped? │
│ ──yes──► reject diff, re-prompt "fix code not test" │
│ 7. GATE re-run the WHOLE suite. failing count strictly down? │
│ ──no──► revert this diff, count as no-progress │
└──────────────────────────────────────────────────────────────────────┘
▲ │
└──────────────────── next iteration ──────────────────────┘
3. The driver
A tiny orchestrator: run, parse first failure, detect stuck/flake, call the agent, verify. The runner is wrapped so its exit code — never the agent — decides truth.
scripts/test-and-fix.mjs:
import { execSync } from "node:child_process";
import { agent } from "./orchestrator.mjs";
const TEST_CMD = process.env.TEST_CMD || "pytest -x -q"; // -x: stop at first failure
const MAX_ITERS = Number(process.env.MAX_ITERS || 30);
const TEST_GLOB = process.env.TEST_GLOB || "tests/**"; // what counts as "a test file"
// Run tests. Return { code, out, failing } — NEVER throw; non-zero is data, not an error.
function runSuite(cmd = TEST_CMD) {
try { return { code: 0, out: execSync(cmd, { encoding: "utf8" }) }; }
catch (e) { return { code: e.status ?? 1, out: (e.stdout || "") + (e.stderr || "") }; }
}
// Pull the FIRST failure out of the runner output. (pytest/jest dialects differ; key off
// the runner's own machine-readable mode where you can: pytest --tb=short, jest --json.)
function firstFailure(out) {
const id = out.match(/^FAILED (\S+)/m)?.[1] // pytest "FAILED tests/x.py::test_y"
|| out.match(/✕ (.+)$/m)?.[1]; // jest "✕ adds two numbers"
const loc = out.match(/^(.+\.\w+):(\d+):/m); // file:line of the assertion
const top = out.match(/^E\s+.+$/m)?.[0] // pytest assertion line
|| out.match(/Expected.+Received.+/s)?.[0]; // jest diff
if (!id) return null;
return { id, file: loc?.[1], line: loc?.[2], detail: (top || "").trim(),
signature: `${id}::${(top || loc?.[0] || "").slice(0, 120)}` };
}
// Count current failures cheaply (full suite, no -x), so we can prove monotonic descent.
function failingCount() {
const { out } = runSuite(TEST_CMD.replace(" -x", ""));
return (out.match(/(\d+) failed/)?.[1] ?? "0") | 0;
}
// Did the agent cheat by shrinking/weakening the tests? Reject if so.
function testsTampered() {
const stat = execSync(`git diff --numstat -- ${TEST_GLOB}`, { encoding: "utf8" }).trim();
if (!stat) return false; // tests untouched: fine
const removed = stat.split("\n").reduce((n, l) => n + (+l.split("\t")[1] || 0), 0);
const weakened = /\bassert(True|Truthy)\b|@(pytest\.mark\.)?skip|\.skip\(|toBeTruthy|--snapshot-update/
.test(execSync(`git diff -- ${TEST_GLOB}`, { encoding: "utf8" }));
return removed > 0 || weakened; // any deletion or softening = reject
}
let lastSig = null, lastCount = Infinity, stalls = 0;
for (let i = 1; i <= MAX_ITERS; i++) {
const run = runSuite();
if (run.code === 0) { console.log(`✅ GREEN after ${i - 1} iterations`); process.exit(0); }
const f = firstFailure(run.out);
if (!f) { console.error("⛔ Non-zero exit but no parseable failure — ENV/SETUP error, not a test failure."); process.exit(2); }
// NO-PROGRESS: identical failure signature back-to-back means the last fix did nothing.
if (f.signature === lastSig) {
console.error(`⛔ STUCK on ${f.id} (same signature twice). Escalating.\n${f.detail}`);
process.exit(3);
}
// FLAKE: re-run ONLY this test, no edits. If it now passes, it's non-deterministic.
const recheck = runSuite(`${TEST_CMD.replace(" -x", "")} ${f.id.split("::")[0]} -k "${f.id.split("::").pop()}"`);
if (recheck.code === 0) {
console.warn(`🟡 FLAKE quarantined: ${f.id} (failed, then passed on bare re-run). Classify; do NOT 'fix'.`);
execSync(`git stash -u`); // park, mark for human; skip so the loop can make real progress
lastSig = f.signature; continue;
}
// The agent sees ONLY this failure and the source it points at — not the whole suite.
await agent(
`A single test is failing. Make the MINIMAL code change to pass it. Do NOT touch any test file.
FAILING TEST: ${f.id}
LOCATION: ${f.file}:${f.line}
FAILURE:
${f.detail}
RULES:
- Edit ONLY application/source files, never anything under ${TEST_GLOB}.
- Smallest diff that makes this test pass. No refactors, no renames, no new deps.
- If the test asserts behavior the code can't provide without a redesign, STOP and explain — don't hack the test.`,
{ label: `fix:${f.id}` });
// GUARD: the agent must not have weakened the suite to win.
if (testsTampered()) {
console.error(`⛔ TAMPER: test files were shrunk/weakened for ${f.id}. Reverting.`);
execSync(`git checkout -- ${TEST_GLOB}`); // restore tests; the source diff stays for review
process.exit(4);
}
// GATE: re-run the WHOLE suite. Failing count must strictly drop, or the fix is net-negative.
const now = failingCount();
if (now >= lastCount) {
stalls++;
console.warn(`↩︎ count ${lastCount} → ${now} (no improvement). Reverting source diff.`);
execSync(`git checkout -- . ; git clean -fd`); // revert; let next iter re-approach
if (stalls >= 2) { console.error("⛔ Oscillating / no monotonic descent. Escalating."); process.exit(5); }
} else {
console.log(`▸ ${f.id} fixed — failing ${lastCount} → ${now}`);
lastCount = now; stalls = 0;
}
lastSig = f.signature;
}
console.error(`⛔ Hit MAX_ITERS (${MAX_ITERS}) still red. Escalating.`); process.exit(6);
The whole design is in three lines of that file: the agent never sees the suite (only its one failure), the agent never touches the tests (the guard reverts them), and the count must drop (the gate reverts anything that doesn't help).
4. The gate (non-negotiable)
The gate is the test command's exit code on the full suite — not on the single test you just fixed.
pytest -q # exit 0 = green. Anything else = still red, regardless of what the agent says.
echo $?
Two rules make it trustworthy:
- Full suite, every iteration. A green single test with a red suite is a regression you introduced. The cursor is one test; the gate is all of them.
- The failing count must trend strictly down.
11 → 6 → 3 → 1 → 0. If it goes up or sideways, the last diff was net-negative — revert it. A fix that doesn't lower the count isn't a fix.
A change that doesn't drop the failing count, with the tests untouched, didn't happen — the driver reverts it.
5. Hard-won gotchas (each one cost a real debugging cycle)
| Symptom | Cause | Fix |
|---|---|---|
Test goes green but the agent "fixed" it by changing assertEqual(x, 42) to assertTrue(x) | Weakening the assertion is the path of least resistance to a green exit code | git diff the test files every iteration; reject on any deletion or softened matcher (assertTrue, toBeTruthy, @skip). The guard restores them. |
| The loop runs MAX_ITERS, the "failure" never changes, and no diff helps | A setup/environment failure (missing env var, no DB, import error, port in use) surfaces as a non-zero exit and looks like a test failure | Triage first: if exit≠0 but no test is parseable as FAILED (collection error, ModuleNotFoundError), it's environment — stop and fix infra, never hand it to the fix agent |
| Snapshot/golden tests go green instantly across the board | The agent ran --snapshot-update / jest -u / pytest --snapshot-update, "fixing" by overwriting the expected output | Treat any snapshot regeneration as tampering. Snapshots are tests; updating them to match buggy output enshrines the bug. Forbid the flag; require a real code fix |
| Failure #1 fixed, then #2 appears that "wasn't there before" | The #1 fix mutated shared state (fixture, module-level singleton, global counter) and exposed/created #2 | Always re-run the whole suite, not the single test. The count gate catches the regression (count didn't drop → revert) |
| Same test fails, then the bare re-run passes, agent keeps editing | A flaky test (timing, ordering, network, randomness without a seed) — there's no code bug to fix | The flake check re-runs the bare test once; if it passes, quarantine + classify (don't "fix" a phantom). Loop continues on real failures |
| Two agents fixing two files corrupt each other's diff | Fanned out failures that shared a fixture/db/global; parallel edits collided | Fan out only across failures in independent test files with no shared state, each agent in its own git worktree; serialize same-module failures |
Count oscillates 5 → 4 → 5 → 4… forever | Two failures are coupled — fixing A breaks B and vice-versa | The stalls >= 2 / non-monotonic guard stops it. This is a real design problem (over-constrained tests); escalate to a human, don't loop |
6. The guards that make "green" mean something
Three checks run every iteration, before the count is trusted:
- Tamper guard — diff the tests.
git diff --numstat -- tests/**: if lines were removed, or a matcher was softened, or askip/--snapshot-updateappeared, the agent cheated. Revert the tests, keep the source for inspection, and either re-prompt ("fix the code, the test is correct") or escalate. Without this guard the loop converges to a green suite that tests nothing. - No-progress guard — signature the failure. Hash
test_id + top stack frame. The same signature two iterations running means the last edit didn't move the needle — stop, don't burn 28 more iterations on the same wall. Escalate with the failure attached. - Flake guard — bare re-run. Before spending an agent on a failure, re-run just that test with no edits. Pass-on-re-run = non-deterministic; quarantine it (skip + tag for a human) so it can't masquerade as a fixable bug and stall real progress.
7. Fan-out (only when failures are independent)
When the suite has many failures across independent test files (no shared fixture, db, or global), fan out — one agent per file, each in its own worktree so parallel source edits don't collide.
import { parallel } from "./orchestrator.mjs";
// Group failures by test FILE; keep only files that share no fixture (heuristic: distinct
// conftest/setup, no module-level shared singletons). Same-file failures stay serial.
const independent = groupByFile(allFailures).filter(noSharedState);
await parallel(independent.map(group => async () => {
const wt = await worktree(`fix/${slug(group.file)}`); // isolate file mutations
for (const f of group.failures) await fixOne(f, { cwd: wt }); // serial within a file
return { file: group.file, fixed: group.failures.length };
}));
// Then merge worktrees and run the GATE on the union — the full suite, once, exit 0 or bust.
The merge re-runs the whole suite: independent at fix-time does not mean independent at gate-time. If the union is red, a cross-file coupling existed — fall back to serial.
8. Commit & cadence
- One commit per fixed failure, message naming the test:
fix: pass test_user_serializer_handles_null (failing 6→5). Each commit is one attributable change, bisectable and revertible. This is the whole point of one-failure-at-a-time — don't collapse it into a single "make tests pass" mega-commit. - Never commit a red suite. The pre-commit gate is
pytest -q && git commit. If you must checkpoint mid-loop, commit on a branch and squash once green. - Commit the quarantine separately. Flaky tests get their own
test: quarantine flaky test_x (non-deterministic, see #NNN)commit and an issue — visible, not silently skipped.
9. When to stop (convergence)
The signal is the failing count, and that it strictly decreases. A healthy run:
| Iteration | Failing | Action |
|---|---|---|
| 0 (start) | 11 | — |
| 1 | 6 | one fix cleared a shared import error (cascade) |
| 2 | 3 | |
| 3 | 1 | |
| 4 | 0 | ✅ green — stop |
Stop and escalate (don't loop) when any of these fire:
- Count stops descending (sideways or up two iterations running) → coupled/over-constrained tests; a human decision is needed about which behavior is correct.
- Same failure signature twice → the agent has no leverage on this failure; it likely requires a design change the loop shouldn't make autonomously.
- MAX_ITERS reached still red → the remaining failures are harder than the loop's one-minimal-diff budget. Hand back the current diff + the failing list.
Monotonic-down-to-zero is success. Anything that plateaus above zero is a signal to stop, not a reason to raise MAX_ITERS — a loop that can't lower the count won't lower it with more turns, it'll just start gaming the gate.
10. How to re-run it
# Point it at your runner; -x / --maxfail=1 so it stops at the first failure each run.
TEST_CMD="pytest -x -q" MAX_ITERS=30 TEST_GLOB="tests/**" node scripts/test-and-fix.mjs
TEST_CMD="npx vitest run --bail=1" TEST_GLOB="**/*.test.ts" node scripts/test-and-fix.mjs
TEST_CMD="go test ./... -failfast" TEST_GLOB="**/*_test.go" node scripts/test-and-fix.mjs
# Verify the green is real (not a tampered/skipped suite):
git diff --stat $(git merge-base HEAD main) -- tests/ # expect: NO removed test lines
pytest -q ; echo "exit=$?" # expect: exit=0, 0 skipped (or only quarantined)
11. Why it works
- The exit code, not the agent, is the judge — so "I think it's fixed" can't ship a red suite. The runner is the single source of truth and it's mechanical.
- One failure per iteration keeps attribution clean — every green is traceable to exactly one diff, so a regression has exactly one suspect, and a revert is exact.
- The guards make autonomy safe — diffing the tests stops the agent from winning by deleting the goalposts; the flake check stops it chasing ghosts; the count gate stops it shipping net-negative fixes. Without them, an unattended fix-loop converges on a green suite that proves nothing.
- The stop signal keeps it honest — monotonic-down-or-escalate means the loop either makes real progress or admits it's stuck, instead of degrading into busywork that quietly weakens the suite to hit zero.
A red→green loop where the runner is the only oracle, the tests are guarded against the fixer, and the failing count must fall to zero or the loop hands the wall back to you.
Install this skill directly: skilldb add agentic-loops-skills
Related Skills
bug-hunt-loop
An adversarial find → dedup → verify → fix loop that audits a codebase or PR for REAL
data-backfill-loop
A cursor → batch → checkpoint → verify → resume loop for running a transformation over a
eval-driven-loop
An eval → improve-one-thing → re-eval hill-climbing loop for developing an LLM feature
migration-loop
A scout → pipeline → gate-each → residue-loop pattern for a large MECHANICAL change
refactor-under-tests-loop
A characterize → green → tiny-refactor → green loop for restructuring code WITHOUT
research-synthesis-loop
A gather → synthesize → critique-gaps → fill loop that builds a comprehensive, fully-cited