Skip to main content
Technology & EngineeringAgentic Loops340 lines

bug-hunt-loop

An adversarial find → dedup → verify → fix loop that audits a codebase or PR for REAL

Quick Summary29 lines
A repeatable loop that turns "an agent skimmed the diff and listed 30 maybe-bugs" into
"here are 6 bugs that are provably real, each with a repro." Each round it **fans out diverse
finders, dedups against everything seen so far, adversarially verifies each fresh finding
with skeptics told to refute it, fixes only the survivors, and re-scans.** Run it until the

## Key Points

- **Diversity beats redundancy.** Five finders with the *same* prompt converge on the same
- **A verifier told to "confirm" rubber-stamps.** "Is this a bug? Here's the claim" gets you
- **Dedup against the SEEN set, not the confirmed set.** If you only remember what you
- **One attributable fix per confirmed bug.** Each fix is one commit referencing one finding
- **A hard gate, every fix.** A fix isn't done because the agent says so — it's done when the
- **Loop until dry, with K ≥ 2.** A fixed round count ("run 3 rounds") misses the long tail
- **No silent caps.** If you top-N findings, sample files, or skip a directory, **log what you
1. It survived adversarial verification (majority confirm) **and** carries a written repro /
2. Its fix passes the **project's existing gate** — whatever the repo already runs in CI:
- **One commit per confirmed bug**, message referencing the finding and its repro
- **Never batch unrelated fixes** into one commit — if one regresses the gate, you want to
- Run the gate **after each fix**, not once at the end. A round that fixed 4 bugs should be 4

## Quick Example

```bash
# Use the repo's OWN gate. Discover it from package.json / Makefile / CI, e.g.:
npm test          # or: pytest -q  /  go test ./...  /  cargo test
npm run typecheck # or: npx tsc --noEmit  /  mypy  /  go vet
```
skilldb get agentic-loops-skills/bug-hunt-loopFull skill: 340 lines
Paste into your CLAUDE.md or agent config

Bug-Hunt Loop

A repeatable loop that turns "an agent skimmed the diff and listed 30 maybe-bugs" into "here are 6 bugs that are provably real, each with a repro." Each round it fans out diverse finders, dedups against everything seen so far, adversarially verifies each fresh finding with skeptics told to refute it, fixes only the survivors, and re-scans. Run it until the codebase goes quiet: K consecutive rounds that surface nothing new.

The whole design fights two failure modes that make agentic bug hunts worthless — finders that all flag the same obvious thing (redundancy) and verifiers that agree with whatever they're shown (sycophancy). Diversity beats the first; adversarial default-to-refuted verification beats the second.


1. The philosophy

  • Diversity beats redundancy. Five finders with the same prompt converge on the same obvious null-deref and miss the race condition entirely. Give each finder a different lens — correctness, security, concurrency, resource-leak, API-misuse — and they cover different bug classes. N identical finders = 1 finder with N× the cost.
  • A verifier told to "confirm" rubber-stamps. "Is this a bug? Here's the claim" gets you a confident yes on plausible-but-wrong findings. Flip it: spin up N skeptics, each told "your job is to REFUTE this; default to refuted; confirm ONLY on strong, specific evidence (a concrete input that triggers it)." A finding ships only if a majority confirm. This single inversion is what makes the output trustworthy.
  • Dedup against the SEEN set, not the confirmed set. If you only remember what you confirmed, every verifier-rejected finding gets re-found next round, re-verified, re-rejected — forever. The loop never goes dry. Remember everything a finder ever surfaced (by file:line:claim), accepted or rejected, and skip it on sight.
  • One attributable fix per confirmed bug. Each fix is one commit referencing one finding and its repro. A change you can't attribute to a verified bug is a change you can't review or revert cleanly.
  • A hard gate, every fix. A fix isn't done because the agent says so — it's done when the project's existing test + typecheck gate passes. A "fix" that breaks the gate didn't happen.
  • Loop until dry, with K ≥ 2. A fixed round count ("run 3 rounds") misses the long tail of bugs that only surface after earlier ones are fixed. And one empty round can be luck — a finder having a bad day. Require K consecutive dry rounds (K=2 minimum) before you stop.
  • No silent caps. If you top-N findings, sample files, or skip a directory, log what you dropped. A report that silently audited 40 of 120 files but reads as "audited the codebase" is worse than no report — it manufactures false confidence.

2. The loop (one round)

┌──────────────────────────────────────────────────────────────────────┐
│  1. FAN-OUT FIND   N finders, ONE lens each, in parallel:             │
│        correctness · security · concurrency · leak · api-misuse       │
│        → each returns {file, line, claim, severity, why, repro?}      │
│  2. DEDUP          drop any finding whose file:line:claim is in the    │
│                    SEEN set (all rounds, accepted OR rejected)         │
│        → fresh findings only                                          │
│  3. ADVERSARIAL    for each fresh finding: M skeptics in parallel,    │
│     VERIFY         each told to REFUTE (default-to-refuted).          │
│        → CONFIRMED iff majority confirm WITH a concrete repro         │
│  4. SEVERITY GATE  drop confirmed-but-trivial (style/nit) — this is   │
│                    a BUG hunt, not a linter                           │
│  5. FIX            one commit per confirmed bug; must pass the         │
│                    project's test + typecheck gate                    │
│  6. RECORD         add every fresh finding (confirmed+rejected) to     │
│                    SEEN; update the dry counter                       │
└──────────────────────────────────────────────────────────────────────┘
        fresh-confirmed this round == 0  ──► dry++   else dry=0
                          ▲                              │
                          └──── re-scan while dry < K ───┘

3. Component A — the diverse finder fan-out

Each finder is a separate agent with a different lens prompt. Same diff, different eyes. Run them with parallel() — they only read, so no file isolation is needed.

const LENSES = {
  correctness: `Off-by-one, null/undefined deref, wrong operator, inverted condition,
                unhandled error path, incorrect type coercion, wrong default, dead/unreachable
                branch that hides a bug, an await that isn't awaited.`,
  security:    `Injection (SQL/shell/path), missing authz check, secret in code/logs,
                unvalidated redirect, SSRF, unsafe deserialization, regex DoS, a user input
                that reaches a sink unsanitized.`,
  concurrency: `Race conditions, check-then-act (TOCTOU), shared mutable state across
                async boundaries, missing lock/await, dedup against the wrong set, a Map
                mutated while iterated, unawaited promise that orders wrongly.`,
  leak:        `File handles / sockets / DB connections / timers / listeners / subscriptions
                opened but not closed on every path (incl. the error path); unbounded caches;
                an array that only grows.`,
  apiMisuse:   `Wrong call order, ignored return value that signals failure, missing
                pagination (silently caps results), float for money, wrong timezone/locale,
                an SDK used against its own docs, a "no silent cap" violation.`,
};

const FIND_SCHEMA = {
  findings: [{ file:'string', line:'number', claim:'string',
               severity:'critical|high|medium|low',
               why:'string', repro:'string|null' }],
};

// Each finder gets the diff + the same target, only the lens differs.
const fresh = (await parallel(Object.entries(LENSES).map(([lens, brief]) => () =>
  agent(
    `You are a ${lens.toUpperCase()} bug finder. Audit ONLY through this lens:\n${brief}\n\n` +
    `TARGET:\n${DIFF_OR_TREE}\n\n` +
    `Report concrete bugs as {file,line,claim,severity,why,repro}. ` +
    `claim = a one-line falsifiable statement ("X is null when Y, deref at L42 throws"). ` +
    `Do NOT report style, naming, or "could be cleaner". Bugs only. ` +
    `If your lens finds nothing, return findings: []. Empty is a valid, honest answer.`,
    { label: `find:${lens}`, schema: FIND_SCHEMA },
  )
))).flatMap(r => r.findings);

Why one lens per agent, not one mega-prompt: a single "find all bugs" prompt anchors on the first/most-obvious issue and stops exploring. Splitting the attention budget across fixed lenses forces coverage of classes the model would otherwise skip — the concurrency finder has to think about races because that's its only job.


4. Component B — dedup against ALL-seen

// SEEN persists across rounds. Key on a NORMALIZED claim so trivial rewordings
// of the same bug collapse to one entry.
const seen = new Set();   // "file\tline\tnormalizedClaim"

const keyOf = f => [
  f.file,
  f.line,
  f.claim.toLowerCase().replace(/[^a-z0-9 ]/g, '').replace(/\s+/g, ' ').trim().slice(0, 80),
].join('\t');

const trulyFresh = fresh.filter(f => {
  const k = keyOf(f);
  if (seen.has(k)) return false;   // surfaced before (confirmed OR rejected) → skip
  return true;
});

// CRITICAL: record EVERY fresh finding into SEEN as soon as we've decided to process it —
// so a finding the skeptics REJECT this round is never re-surfaced next round.
for (const f of trulyFresh) seen.add(keyOf(f));

This is the line that makes the loop terminate. If seen only contained confirmed bugs, the security finder would re-flag the same "possible SSRF" every round, the skeptics would reject it every round, and dry would never increment. Dedup against everything you've ever looked at.


5. Component C — adversarial verification (the heart)

Each fresh finding faces M skeptics in parallel. Each skeptic is told to refute, defaults to refuted, and may only confirm with a concrete triggering input. Confirm the finding iff a strict majority confirm.

const M = 3;                          // odd, so "majority" is unambiguous
const NEED = Math.floor(M / 2) + 1;   // 2 of 3

const VERIFY_SCHEMA = { verdict: 'confirmed|refuted', repro: 'string|null', reason: 'string' };

async function verify(finding) {
  const votes = await parallel(Array.from({ length: M }, (_, i) => () =>
    agent(
      `You are skeptic #${i + 1}. A finder CLAIMS this is a bug:\n` +
      `  file: ${finding.file}:${finding.line}\n  claim: ${finding.claim}\n  why: ${finding.why}\n\n` +
      `CODE CONTEXT:\n${readAround(finding.file, finding.line, 60)}\n\n` +
      `Your job is to REFUTE this claim. Default verdict is "refuted". ` +
      `Return "confirmed" ONLY if you can state a CONCRETE input/state that triggers the bug ` +
      `(put it in repro). Reasons it's NOT a bug: a guard upstream, a type that can't be null, ` +
      `the path is unreachable, the caller already validates, it's intended behavior. ` +
      `Be specific. "Looks risky" is REFUTED.`,
      { label: `verify:${finding.file}:${finding.line}:s${i + 1}`, schema: VERIFY_SCHEMA },
    )
  ));
  const confirms = votes.filter(v => v.verdict === 'confirmed' && v.repro);
  return {
    confirmed: confirms.length >= NEED,
    repro: confirms[0]?.repro ?? null,
    votes,
  };
}

const confirmed = [];
for (const f of trulyFresh) {
  const v = await verify(f);
  if (v.confirmed && f.severity !== 'low') confirmed.push({ ...f, repro: v.repro });
  // rejected findings are ALREADY in `seen` (§4) → never re-surfaced
}

Why default-to-refuted and require a repro: an LLM asked "could this be a bug?" finds a story where almost anything could be. Forcing it to name the input that triggers it converts vague unease into a falsifiable claim — and most false positives can't name one, so they die here. A confirmed finding without a repro is a contradiction; drop it.


6. The gate (non-negotiable)

A finding ships only if BOTH hold:

  1. It survived adversarial verification (majority confirm) and carries a written repro / "why it's real."
  2. Its fix passes the project's existing gate — whatever the repo already runs in CI:
# Use the repo's OWN gate. Discover it from package.json / Makefile / CI, e.g.:
npm test          # or: pytest -q  /  go test ./...  /  cargo test
npm run typecheck # or: npx tsc --noEmit  /  mypy  /  go vet

Best of all, write a failing test from the repro first, fix the bug, watch it go green. That test is the durable proof the bug was real and stays fixed. A "fix" that doesn't move the gate (no test reddens, then greens) is a fix you can't trust — re-open the finding.

The rule: a finding that doesn't survive verification didn't happen; a fix that doesn't pass the gate didn't happen. The gate is what makes autonomous fixing safe to run unattended.


7. Commit & cadence

  • One commit per confirmed bug, message referencing the finding and its repro (fix(parser): handle empty buffer — was OOB read at L88, repro: parse("")). One attributable change per fix keeps each round reviewable and revertible.
  • Never batch unrelated fixes into one commit — if one regresses the gate, you want to revert it, not the whole round.
  • Run the gate after each fix, not once at the end. A round that fixed 4 bugs should be 4 green gate runs, so a regression is attributed to the fix that caused it.
  • At round end, append the rounds table (below) to the run log, including the drop log (files/findings you capped) — no silent caps.

8. When to stop (convergence)

The signal is the fresh-confirmed rate per round → 0. Track it; stop after K=2 consecutive dry rounds (zero fresh-confirmed). A real run against a mid-size service:

RoundFoundFresh (new)Confirmeddry
1313170
2241130
319610
417201
516102 → stop

Read it: Found stays high (finders keep re-surfacing the same known issues — expected), but Fresh decays as the SEEN set saturates, and Confirmed is the truth — it trends to 0. Rounds 4–5 surface only already-seen or refutable noise: two dry rounds, stop. Had you stopped at the first dry round (round 4), you'd have been right by luck — but round 3 still caught a real bug, so K=2 is cheap insurance against quitting one round early.

If Confirmed stays > 0 and Fresh isn't decaying, you're not converging — the target is genuinely buggy or your finders are too broad. Tighten lenses, or the codebase needs more than a hunt.


9. How to re-run it

# 0. Decide the target: a PR diff (precise) or the whole tree (broad).
git diff main...HEAD > /tmp/target.diff        # PR mode
# or point the finders at the tree via ripgrep-scoped file lists.

# 1. Run the orchestrator: parallel finders (§3) → dedup (§4) → parallel skeptics (§5).
#    Persist SEEN and the dry counter ACROSS rounds (a JSON file on disk is fine).

# 2. For each confirmed finding: write a failing test from the repro, fix, run the gate.
npm test && npm run typecheck                  # the repo's own gate (§6)
git commit -m "fix(<area>): <claim> — repro: <input>"

# 3. Re-scan. Stop when dry >= 2. Emit the rounds table + the drop log.

To resume a hunt later, reload SEEN from disk — the loop picks up without re-litigating every previously-rejected finding.


10. Gotchas (each one cost a real debugging cycle)

SymptomCauseFix
Loop never goes dry — same count of findings every roundDedup against the confirmed set only → verifier-rejected findings re-surface foreverDedup against the SEEN set (all fresh findings, confirmed and rejected) — §4
Verifier confirms a plausible-but-wrong bugSkeptic prompted to "confirm/validate" → sycophancy; agrees with whatever it's shownTell it to REFUTE, default-to-refuted, confirm only with a concrete repro — §5
All five finders report the same obvious bug; subtle classes missedFinder homogeneity — identical prompts anchor on the most-obvious issueOne lens per finder; N identical finders ≠ N finders — §3
"Fixed" a line that was actually correct; gate now redSkipped verification, or a single sycophantic verifier waved it throughThe verification gate (majority of refute-first skeptics + repro) is exactly what prevents this — never fix an unverified finding
Report reads "audited the codebase" but missed half the filesSilently top-N'd findings or sampled files to fit a budgetNo silent caps — log every dropped file/finding in the run output
Two skeptics tie, finding flip-flops between roundsEven M (e.g. 2) → no majority; nondeterminism decidesUse odd M (3); NEED = floor(M/2)+1 is unambiguous — §5
Round count fixed at 3, tail bugs escapeA fixed budget stops before the SEEN set saturates; one early dry round was luckLoop-until-dry, K=2 consecutive dry rounds — §8
Gate passes but the bug isn't actually fixedNo test encodes the repro; agent "fixed" by editing around itWrite the failing test from the repro first, then fix — §6

11. Why it works

  • Diverse lenses make finders cover classes — the concurrency finder thinks about races because races are its only job, so the bug the generalist would skip gets found.
  • Refute-first skeptics make verification honest — defaulting to refuted and demanding a concrete triggering input kills the plausible-but-wrong findings that make audits noisy.
  • Dedup-against-seen makes the loop terminate — rejected findings stay rejected instead of haunting every round, so "dry" actually arrives.
  • The project's own gate makes fixing safe to automate — fixes ride the same tests the team already trusts, with a new repro test as durable proof.
  • Loop-until-dry + no silent caps make it truthful — it converges to "nothing new survives scrutiny," and it tells you exactly what it looked at, so high precision is earned, not asserted.

A precision-first bug hunt: diverse finders, refute-first skeptics, dedup against all-seen, the repo's own gate, stop when two rounds run dry. Reusable for any codebase or PR audit.

Install this skill directly: skilldb add agentic-loops-skills

Get CLI access →