Skip to main content
Technology & EngineeringAgentic Loops373 lines

refactor-under-tests-loop

A characterize → green → tiny-refactor → green loop for restructuring code WITHOUT

Quick Summary29 lines
A repeatable loop that turns "the agent restructured the module and swears it behaves the
same" into "the same test suite that passed before passes after — at every single step." It
**ensures a green characterization net first, then makes one tiny structure-preserving change,
runs the FULL suite, and keeps the step only if it stays green — reverting instantly if it goes

## Key Points

- **Characterization tests FIRST — you cannot safely refactor untested code.** A refactor's
- **Pin behavior including its bugs.** A characterization test asserts the *actual* current
- **Tiny steps, each independently green and revertible.** Extract one function. Rename one
- **REVERT, don't debug forward.** A red step means `git checkout` and take a *smaller* step —
- **Behavior preservation is THE invariant — the passing suite IS the proof.** Not the agent's
- **If you need to edit a test to make it pass, STOP.** Changing a characterization test *is*
- **Never mix a refactor commit with a behavior commit.** A commit is *either* "restructured, 0
- **Stop at the structural goal — don't gold-plate.** The target is "this module now has the
- **`git checkout -- .` on red (revert, don't debug forward).** A red step is returned to the
- **the `testFilesTouched()` guard.** If the agent edited the net to chase green, that's the
- **Characterize-first makes "preserved" *verifiable*.** You can only prove behavior is
- **Tiny steps make every commit *bisectable*.** One structure-preserving change per green

## Quick Example

```bash
# Optional: prove the public surface is byte-identical (extract internals freely, but the
# exported signatures must not move). Compare emitted declarations before/after.
npx tsc --emitDeclarationOnly --outDir /tmp/dts.after && diff -u /tmp/dts.before /tmp/dts.after
```
skilldb get agentic-loops-skills/refactor-under-tests-loopFull skill: 373 lines
Paste into your CLAUDE.md or agent config

Refactor-Under-Tests Loop

A repeatable loop that turns "the agent restructured the module and swears it behaves the same" into "the same test suite that passed before passes after — at every single step." It ensures a green characterization net first, then makes one tiny structure-preserving change, runs the FULL suite, and keeps the step only if it stays green — reverting instantly if it goes red — until the target structure is reached. Behavior preservation isn't a promise; it's the green suite, proven again after every edit.

The whole design fights the thing that makes refactors silently dangerous: "it still compiles" and even "the agent reviewed it" are not evidence behavior was preserved. Only a test that pinned the old behavior and still passes is. So the suite is the contract, the contract is frozen, and a change that reddens it didn't happen — it gets reverted, not debugged.


1. The philosophy

  • Characterization tests FIRST — you cannot safely refactor untested code. A refactor's whole claim is "behavior unchanged," and you can only verify unchanged against a recorded baseline. If the target module is thin on tests, the agent writes characterization tests that pin its current observable outputs before touching a line — capturing what it does, not what it should do.
  • Pin behavior including its bugs. A characterization test asserts the actual current output, even if that output is wrong (off-by-one, weird rounding, a swallowed error). The point is to freeze the contract so a refactor can't silently alter it. Fixing the bug is a separate, labeled behavior change — never smuggled into a refactor.
  • Tiny steps, each independently green and revertible. Extract one function. Rename one symbol. Inline one variable. One step = one diff that's green on its own. A refactor that sits red for 20 minutes across five entangled edits can't be bisected when something breaks — you lose the one property that makes this safe.
  • REVERT, don't debug forward. A red step means git checkout and take a smaller step — never fix-forward inside a half-refactored state. Debugging forward from red means you're now changing behavior to chase green, which is the exact failure this loop exists to prevent.
  • Behavior preservation is THE invariant — the passing suite IS the proof. Not the agent's reasoning, not the type-checker, not "looks equivalent to me." The same suite, green before and green after each step, is the only artifact that proves behavior didn't change.
  • If you need to edit a test to make it pass, STOP. Changing a characterization test is changing the behavior contract — that's a behavior change wearing a refactor's clothes. Back it out and do it as a separate, explicitly-labeled change with its own commit.
  • Never mix a refactor commit with a behavior commit. A commit is either "restructured, 0 behavior delta, suite green" or "changed behavior, tests updated to match." One or the other, attributable, so a reviewer (or a bisect) can trust the label.
  • Stop at the structural goal — don't gold-plate. The target is "this module now has the intended shape," not "infinitely cleaner." Once the structure is reached with the suite green throughout, stop. Endless polishing is busywork that only adds risk.

2. The loop (one step)

┌──────────────────────────────────────────────────────────────────────┐
│  PRE-FLIGHT (once):  coverage(target) ≥ floor?                        │
│     no  → agent WRITES characterization tests pinning CURRENT output  │
│           → run them → MUST be green (they describe what IS)          │
│     yes → suite already green on the target → proceed                 │
│  ── the net is green and FROZEN; do not edit it again ──             │
│                                                                       │
│  PER STEP (repeat):                                                    │
│   1. SNAPSHOT   git stash-free clean tree; note HEAD                  │
│   2. ONE CHANGE one structure-preserving edit (extract / rename /     │
│                 decompose / modernize) — agent touches NO test        │
│   3. GATE       run the FULL suite (not just the target's tests)      │
│   4a. GREEN  →  assert tests/ unchanged in the diff → commit the step │
│   4b. RED    →  `git checkout -- .`  → take a SMALLER step (go to 2)  │
└─────────────────────────────────────────────┬────────────────────────┘
                                              │
        target structure reached AND green throughout? ── no ──► next step
                          │ yes
                          ▼
                  STOP — don't gold-plate

The non-negotiable shape: the suite is green entering step N and green leaving step N. There is no point in the loop where red is acceptable — red is always "revert and shrink," never "press on and fix."


3. Component A — establish the green net (coverage + characterization)

Before the first refactor, prove the target is actually pinned. Measure coverage of the module you're about to restructure; if it's below the floor, generate characterization tests that assert its current outputs, then confirm they're green. Green characterization tests are the starting line — if they're red on the unchanged code, your net is wrong, not the code.

import { execFileSync } from "node:child_process";

const sh = (cmd, args) => execFileSync(cmd, args, { encoding: "utf8", stdio: "pipe" });
const run = (args) => {
  try { return { code: 0, out: sh(args[0], args.slice(1)) }; }
  catch (e) { return { code: e.status ?? 1, out: (e.stdout ?? "") + (e.stderr ?? "") }; }
};

const TARGET = "src/pricing/quote.ts";
const COVERAGE_FLOOR = 0.85;   // of the LINES you intend to move/restructure, not the whole repo

// 1) Measure coverage of THE TARGET specifically — not the global %, which hides a cold module.
function coverageOf(file) {
  run(["npx", "vitest", "run", "--coverage", "--coverage.reporter=json-summary"]);
  const summary = JSON.parse(sh("cat", ["coverage/coverage-summary.json"]));
  const key = Object.keys(summary).find(k => k.endsWith(file));
  return key ? summary[key].lines.pct / 100 : 0;
}

// 2) If thin, have the agent WRITE characterization tests pinning CURRENT behavior (bugs incl.).
async function characterize(file) {
  const source = sh("cat", [file]);
  await agent(
    `You are writing CHARACTERIZATION tests for ${file}. The goal is to PIN ITS CURRENT\n` +
    `OBSERVABLE BEHAVIOR so a later refactor can be proven behavior-preserving.\n\n` +
    `RULES:\n` +
    `- Assert what the code ACTUALLY returns/throws/logs RIGHT NOW — run it to find out.\n` +
    `  If a result looks WRONG (bad rounding, off-by-one, a swallowed error), pin the WRONG\n` +
    `  value anyway and add // CHARACTERIZED: looks buggy, preserved intentionally.\n` +
    `- Cover the branches you are about to restructure: every early-return, every error path,\n` +
    `  boundary inputs (empty, zero, negative, max), and at least one realistic happy path.\n` +
    `- Assert CONCRETE observable outputs (exact return values, thrown messages, side effects).\n` +
    `  A test that only asserts "did not throw" or "is truthy" pins NOTHING — forbidden.\n` +
    `- Do NOT change ${file}. Tests only.\n\nSOURCE:\n${source}`,
    { label: `characterize:${file}` },
  );
  // 3) The net MUST be green on the UNCHANGED code — it describes what IS, so it passes now.
  const g = run(["npx", "vitest", "run", file.replace(/\.ts$/, ".char.test.ts")]);
  if (g.code !== 0) throw new Error(`characterization net is RED on unchanged code — the tests\n` +
    `describe behavior the code doesn't have. Fix the TESTS, not the code:\n${g.out.slice(-1200)}`);
  return g;
}

let cov = coverageOf(TARGET);
console.log(`COVERAGE(${TARGET}) = ${(cov * 100).toFixed(0)}%`);
if (cov < COVERAGE_FLOOR) {
  await characterize(TARGET);
  cov = coverageOf(TARGET);
  console.log(`COVERAGE(${TARGET}) after characterization = ${(cov * 100).toFixed(0)}%`);
}
if (cov < COVERAGE_FLOOR) throw new Error(`net still below floor (${(cov*100)|0}%) — do NOT refactor blind`);

Why coverage of the target, not the repo. A repo at 90% global coverage can have the exact module you're restructuring at 30% — the cold path nobody tested. The global number manufactures false confidence. Gate on the lines you're about to move, or you're refactoring blind under a green dashboard.

The completeness check on the net: a characterization suite is only as good as the branches it pins. Before trusting it, confirm every branch you're about to touch is exercised — a refactor that collapses an untested else is a behavior change you'll never see go red.


4. Component B — the step loop (snapshot → one change → gate → keep/revert)

Each step is a clean-tree snapshot, exactly one structure-preserving edit by the agent, the full suite as the gate, and a hard branch: green ⇒ commit; red ⇒ revert and shrink. The agent is forbidden from touching any test file — verified from the diff, not trusted.

const PLAN = [                                  // the target structure as ORDERED tiny steps
  { id: "extract-tax",     ask: "Extract the inline tax calc (lines computing `subtotal*rate`) into a private `applyTax(subtotal, rate)`; call it. No other change." },
  { id: "rename-amt",      ask: "Rename the local `amt` to `subtotal` everywhere in this module. Pure rename." },
  { id: "decompose-quote", ask: "Split `buildQuote` so line-item assembly moves into `assembleLineItems(cart)`; `buildQuote` calls it. Behavior identical." },
  { id: "modernize-loop",  ask: "Replace the index `for` loop over `items` with `.map`/`.reduce`. Same output." },
];

const fullSuite = () => run(["npx", "vitest", "run", "--reporter=dot"]);
const testFilesTouched = () =>
  sh("git", ["diff", "--name-only"]).split("\n").filter(f => /\.(test|spec)\.[tj]sx?$/.test(f) || /\.char\./.test(f));

async function runStep(step, attempt = 1) {
  // 1. SNAPSHOT: the tree is clean (last step committed); HEAD is the known-green baseline.
  if (sh("git", ["status", "--porcelain"]).trim()) throw new Error("dirty tree at step start — a prior step left residue");

  // 2. ONE structure-preserving change. On retry, demand a STRICTLY smaller move.
  await agent(
    `Make ONE structure-preserving change to ${TARGET}. ${step.ask}\n` +
    (attempt > 1 ? `PREVIOUS ATTEMPT WENT RED. Take a STRICTLY SMALLER step — do the safest\n` +
                   `sub-part only (e.g. introduce the new function but leave the old call in place).\n` : "") +
    `HARD RULES:\n` +
    `- Do NOT edit, add, or delete any test file. The characterization net is FROZEN.\n` +
    `- Do NOT change observable behavior: same return values, same thrown errors, same side\n` +
    `  effects, same public signatures. This is a refactor, not a fix.\n` +
    `- If you discover a BUG, do NOT fix it here. Note it and preserve current behavior.`,
    { label: `refactor:${step.id}:a${attempt}` },
  );

  // GUARD: the agent must not have touched the contract. If it did, the "refactor" is suspect.
  const tainted = testFilesTouched();
  if (tainted.length) {
    sh("git", ["checkout", "--", "."]);
    throw new Error(`step ${step.id} edited tests ${tainted.join(",")} — that's a behavior-contract\n` +
      `change masquerading as a refactor. Reverted. Split it out as a labeled behavior commit.`);
  }

  // 3. GATE: the FULL suite, not just the target's tests — a refactor can break a distant caller.
  const g = fullSuite();

  if (g.code === 0) {
    // 4a. GREEN → commit THIS step alone. One attributable, revertible structural change.
    sh("git", ["add", "-A"]);
    sh("git", ["commit", "-m", `refactor(${step.id}): structure-only, suite green (0 behavior delta)`]);
    return { id: step.id, ok: true, attempt };
  }

  // 4b. RED → REVERT immediately. Never debug forward inside a half-refactored state.
  sh("git", ["checkout", "--", "."]);
  if (attempt >= 4) throw new Error(`step ${step.id} can't go green even at smallest size — the\n` +
    `target structure may itself change behavior, or the net is too tight. STOP and surface:\n${g.out.slice(-800)}`);
  return runStep(step, attempt + 1);   // smaller step, same gate
}

const log = [];
for (const step of PLAN) log.push(await runStep(step));

Two lines carry the whole design:

  • git checkout -- . on red (revert, don't debug forward). A red step is returned to the last green commit on disk, then retried smaller. You never accumulate a half-refactored state, so every commit in the history is independently green and bisectable.
  • the testFilesTouched() guard. If the agent edited the net to chase green, that's the forbidden move — a behavior change disguised as a refactor. Caught from the diff, reverted, surfaced. The contract is frozen by enforcement, not by hope.

5. The gate (non-negotiable)

A step counts only if BOTH hold, every step, no exceptions:

# (1) The FULL suite is green — same tests that were green before this step.
npx vitest run            # or: pytest -q  /  go test ./...  /  cargo test
# Run the WHOLE suite, not just the target file's tests: a refactor can break a caller
# three modules away (an extracted symbol that was implicitly imported, a changed eval order).

# (2) No test file appears in the step's diff.
git diff --name-only | grep -E '\.(test|spec|char)\.' && echo "CONTRACT EDITED — revert" || true

The characterization net is frozen: editing it to make the suite pass changes the behavior contract, which is precisely not a refactor. Optionally, add a third check for the strictest case — no public-API/signature diff — so the loop guarantees callers don't even see a shape change:

# Optional: prove the public surface is byte-identical (extract internals freely, but the
# exported signatures must not move). Compare emitted declarations before/after.
npx tsc --emitDeclarationOnly --outDir /tmp/dts.after && diff -u /tmp/dts.before /tmp/dts.after

The rule: a step that reddens the frozen suite didn't refactor — it changed behavior, and it gets reverted, not patched. The passing suite, unchanged, is the entire proof that behavior was preserved. No green, no step.


6. When to stop (convergence)

The signal is steps remaining in the plan → 0, with the suite green at every step. Stop the moment the target structure is reached — do not gold-plate. From a real run decomposing a 180-line buildQuote god-function (coverage 41% → characterized to 88% first):

StepChangeFiles touchedAttemptsSuite
0characterization net (12 tests pinning current output, incl. 1 buggy round-down)quote.char.test.ts1green
1extract applyTaxquote.ts1green
2rename amtsubtotalquote.ts1green
3decompose assembleLineItemsquote.ts2 (1st collapsed an empty-cart branch → red → reverted → smaller)green
4modernize loop → reducequote.ts1green
target shape reachedstop (don't gold-plate)

Read it: Attempts is the honesty column. Step 3 went red on the first try because the agent "helpfully" simplified an empty-cart branch — a behavior change. The gate caught it, the loop reverted, the retry took a smaller step, and it landed green. The buggy round-down pinned in step 0 rode through all four refactors unchanged — which is correct: preserving the bug is how you know you preserved behavior. Fixing it is a separate labeled commit, after.

If a step can't go green even at its smallest size (attempt 4), the loop stops and surfaces: either the "refactor" genuinely changes behavior (so it's not a refactor — do it as a behavior change with the net updated and labeled), or the net is over-specified (asserting an implementation detail, not observable behavior). Both are real answers; neither is "force it."


7. How to re-run it

# 1. Pick the target module and measure ITS coverage (not the repo's).
npx vitest run --coverage --coverage.reporter=json-summary
node -e "const s=require('./coverage/coverage-summary.json');const k=Object.keys(s).find(x=>x.endsWith('src/pricing/quote.ts'));console.log(s[k].lines.pct)"

# 2. If thin, generate the characterization net and confirm it's GREEN on the unchanged code.
node refactor-loop.mjs --characterize   # writes *.char.test.ts pinning CURRENT behavior, runs it
npx vitest run quote.char.test.ts        # MUST be green — it describes what IS

# 3. Run the step loop: snapshot → one change → FULL suite → keep/revert, per planned step.
node refactor-loop.mjs                    # logs: step / files / attempts / suite each step
#    Every step that goes green is its own commit; every red step is reverted, never committed.

# 4. There is no separate "final gate" — the suite was green at EVERY step, so HEAD is green
#    by construction. Confirm the public surface didn't move if you took that option:
diff -u /tmp/dts.before /tmp/dts.after

To resume mid-plan, the loop reads its own git log: every completed step is a commit, so it picks up at the first unplanned-and-uncommitted step. Because each commit is independently green, checkout to any step is a valid, working tree — that's the bisectability the tiny steps bought you.


8. Gotchas (each one cost a real debugging cycle)

SymptomCauseFix
Agent "improved" the output mid-refactor; net goes redThe agent fixed a bug / tweaked behavior it noticed while restructuringThat's a behavior change, not a refactor. Revert. Split it into a separate labeled commit after — §1
Suite was already red before the first stepRefactoring on an already-broken baseline — you can't tell preserved from brokenGet to green first. A refactor under a red suite is undefined; fix or skip the failing tests as a separate prior change — §3
Suite stays green but behavior obviously changed in prodCharacterization tests assert nothing meaningful (toBeTruthy, "did not throw")A test must pin concrete observable output — exact value/error/side-effect. "Truthy" pins nothing — §3
Agent rewrote the module wholesale; gate red, can't bisectBig-bang rewrite temptation — replaced instead of restructuredReplace ≠ restructure; it's a riskier game with no incremental gate. Force one tiny step per iteration — §1, §4
Step edited a .char.test.ts to make it passNet treated as mutable; agent changed the contract to chase greenThe net is frozen; the testFilesTouched() guard reverts any test edit on sight — §4
Red step "fixed" by more edits, now a tangled half-refactorDebugged forward instead of revertingRevert on red, take a smaller step. Never fix-forward inside a half-refactored state — §1, §4
Coverage dashboard green, refactor still broke a cold pathGated on global coverage, not the target module'sMeasure coverage of the lines you're moving; the global % hides the cold module — §3
Target file's tests pass but a distant caller brokeRan only the target's tests as the gateGate on the full suite — an extracted/renamed symbol can break an importer modules away — §5
Refactor + behavior fix landed in one commitMixed the two kinds of changeNever mix. A commit is structure-only OR behavior-with-tests-updated — one, labeled, attributable — §1

9. Why it works

  • Characterize-first makes "preserved" verifiable. You can only prove behavior is unchanged against a recorded baseline — so the net that pins the current output (bugs and all) is what converts "trust me" into a falsifiable green/red signal.
  • Tiny steps make every commit bisectable. One structure-preserving change per green commit means when something does break later, git bisect lands on the exact step — a property a 20-minute red rewrite throws away.
  • Revert-on-red makes the loop safe to automate. The agent never accumulates a half-refactored state; the worst case is "no progress this attempt," never "corrupted module the next step builds on."
  • The frozen net makes behavior preservation enforced, not hoped. Forbidding test edits — and catching them from the diff — means the suite stays the contract, so green genuinely means "same behavior," not "agent edited the test until it agreed."
  • Stop-at-the-goal + one-kind-per-commit make the history honest. It ends when the structure is reached instead of gold-plating, and every commit is cleanly attributable to either restructuring or a behavior change — so a reviewer, and a future bisect, can trust the label.

A behavior-preserving refactor loop: characterize to a green net, freeze it, take one tiny revertible step at a time, gate on the FULL suite every step, stop at the structural goal. Reusable for any extract / rename / decompose / modernize under tests.

Install this skill directly: skilldb add agentic-loops-skills

Get CLI access →