Skip to main content
Technology & EngineeringAgentic Loops416 lines

data-backfill-loop

A cursor → batch → checkpoint → verify → resume loop for running a transformation over a

Quick Summary21 lines
A repeatable loop for a transformation over a dataset too big for one shot — backfill a new
column from 40M rows, re-index a collection into a new mapping, reprocess every event through a
fixed parser. The naïve approach (one `UPDATE … WHERE …` or one streaming script) fails three
ways: it locks the table for hours, it dies at row 9.7M on a transient blip and restarts from

## Key Points

- **Idempotency is the foundation — a backfill WILL be interrupted.** Every batch must be
- **Checkpoint after every batch — resume from the cursor, not from zero.** A 6-hour job that
- **Bounded batch size — by row count, not key range.** Small enough that a failed batch is
- **Backpressure — the loop paces itself so it doesn't melt the source.** A tight loop over a
- **Verify-as-you-go — catch a bad transform at batch 3, not after 10M rows.** The per-batch
- **Dry-run / shadow first.** Before mutating anything, run the loop **read-only**: compute the
- **Reconciliation proves completeness — cursor-reached-end ≠ all-rows-correct.** The cursor
- **`ON CONFLICT (id) DO UPDATE` (idempotency).** Without it you have `INSERT` — and a retried
- **`ROLLBACK` on a failing gate, cursor unchanged.** A batch that fails verification leaves the
- **Idempotency makes interruption a non-event.** Because a batch is a function of its input —
- **The durable checkpoint makes long jobs finish.** Resume-from-cursor instead of from zero is
- **The per-batch gate makes autonomy safe.** Counts + invariants on every batch catch a bad
skilldb get agentic-loops-skills/data-backfill-loopFull skill: 416 lines
Paste into your CLAUDE.md or agent config

Data-Backfill Loop

A repeatable loop for a transformation over a dataset too big for one shot — backfill a new column from 40M rows, re-index a collection into a new mapping, reprocess every event through a fixed parser. The naïve approach (one UPDATE … WHERE … or one streaming script) fails three ways: it locks the table for hours, it dies at row 9.7M on a transient blip and restarts from zero, and when it finally "finishes" you have no proof every row is actually correct. This loop instead walks a durable cursor batch by batch, processes each batch idempotently behind a per-batch gate, checkpoints after every batch, and resumes from the cursor — never zero — on restart, then proves completeness with a source-vs-destination reconciliation.

The whole design assumes the truth about long jobs: it WILL be interrupted. A deploy, an OOM kill, a network blip, a kill -9. So idempotency isn't a nicety — it's the foundation. If re-running a batch double-counts, every retry corrupts data; if there's no checkpoint, every blip costs you the whole run. Get those two right and the job becomes boring: it crashes, you restart it, it picks up where it left off, and at the end the numbers match.

It was run against a 41.2M-row orders backfill (compute total_cents from a fixed currency bug): batched at 5k, checkpointed per batch, survived two OOM restarts and one deploy, and reconciled 41,203,118 == 41,203,118 with matching checksums at the end.


1. The philosophy

  • Idempotency is the foundation — a backfill WILL be interrupted. Every batch must be safely re-runnable: upsert by key (INSERT … ON CONFLICT … DO UPDATE, MERGE), never a blind INSERT. A batch that crashed after writing 3k of 5k rows but before checkpointing gets re-run from the same cursor — if the write isn't idempotent, those 3k rows get written twice and your counts are silently wrong forever. Idempotent means a crashed/retried batch is a no-op on the rows it already did.
  • Checkpoint after every batch — resume from the cursor, not from zero. A 6-hour job that restarts at 0 on a transient blip never finishes — it dies again before reaching the end, every time. Persist the cursor to durable storage (a row in a _backfill_progress table, not a variable in memory) after each batch commits. Restart reads the cursor and continues.
  • Bounded batch size — by row count, not key range. Small enough that a failed batch is cheap to retry and the per-batch transaction bounds its lock; large enough to amortize round-trips. Tune by p99 batch latency (target ~1–2s/batch). And bound by LIMIT N, not by a key range (id BETWEEN x AND x+5000) — skewed/sparse keys make a range "batch" unboundedly large or empty.
  • Backpressure — the loop paces itself so it doesn't melt the source. A tight loop over a live prod DB will spike replication lag and p99 for real traffic. Watch a signal (batch latency, replica lag) and sleep proportionally when it rises. A backfill that takes down prod is worse than a slow backfill.
  • Verify-as-you-go — catch a bad transform at batch 3, not after 10M rows. The per-batch gate runs inside the loop. If the transform is wrong, you find out 15 seconds in on 15k rows, not 5 hours in on 10M you now have to undo.
  • Dry-run / shadow first. Before mutating anything, run the loop read-only: compute the new value, diff it against the old, log the deltas — don't write. A bad transform caught in shadow mode costs nothing; caught in prod costs a restore.
  • Reconciliation proves completeness — cursor-reached-end ≠ all-rows-correct. The cursor hitting the last row only means the loop ran; it doesn't mean every row is right (a silently skipped batch, a filter that dropped rows). The final gate is a full source-vs-destination count and checksum. Done means that passes, not that the loop stopped erroring.

2. The loop (one batch)

┌──────────────────────────────────────────────────────────────────────┐
│  ON START: read cursor from _backfill_progress (durable).             │
│            cursor = last_key (resume) OR 0 (fresh). NEVER assume 0.    │
│                                                                       │
│   while cursor < END:                                                 │
│     1. READ      next batch: rows WHERE key > cursor ORDER BY key      │
│                  LIMIT N   (bounded by ROW COUNT, not key range)       │
│     2. PROCESS   transform + UPSERT BY KEY in ONE txn (idempotent —    │
│                  a re-run of this batch is a no-op on rows it did)     │
│     3. VERIFY    rows_in == rows_out (or known delta) AND invariants   │
│                  hold AND no nulls where forbidden    ── the GATE      │
│     4a. PASS  →  COMMIT, then persist cursor = max(key) in batch       │
│                  (checkpoint AFTER the data commit, same or next txn)  │
│     4b. FAIL  →  ROLLBACK the batch, HALT. Do NOT advance the cursor.  │
│                  (residue = this batch; a human/agent inspects it)     │
│     5. BACKPRESSURE  sleep ∝ rising batch latency / replica lag        │
└─────────────────────────────────────────────┬────────────────────────┘
                                              │
        cursor reached END? ── no ──► next batch from new checkpoint
              │ yes
              ▼
   RECONCILE (whole-table): source COUNT/checksum == dest COUNT/checksum
        → pass = DONE.  mismatch = NOT done (find the gap).

The checkpoint sits after the data write and the cursor only ever moves on a passing, committed batch. That single ordering is what makes a crash survivable: you can crash before the data commits (batch re-runs cleanly — idempotent), or after data commits but before the cursor persists (batch re-runs cleanly — idempotent), but you can never end up with the cursor ahead of the data. The worst case is re-doing one already-done batch, which is free.


3. Component A — the driver (durable cursor + resume)

The driver reads the cursor from durable storage on start, walks batches, and checkpoints after each one. On restart it reads the same cursor and continues — there is no "from zero" path.

backfill.mjs:

import pg from "pg";
const db = new pg.Pool({ max: 4 });

const TABLE = "orders";
const KEY   = "id";                 // monotonic, indexed key to page by — NOT created_at (dupes)
const BATCH = 5000;                 // bounded by ROW COUNT (tune by p99 latency, below)

// Durable cursor. One row per job; survives process death. Created once:
//   CREATE TABLE IF NOT EXISTS _backfill_progress(
//     job text PRIMARY KEY, cursor bigint NOT NULL DEFAULT 0,
//     done bigint NOT NULL DEFAULT 0, updated_at timestamptz NOT NULL DEFAULT now());
async function readCursor(job) {
  const r = await db.query(
    `INSERT INTO _backfill_progress(job) VALUES($1)
       ON CONFLICT(job) DO UPDATE SET updated_at = now()   -- idempotent ensure-exists
     RETURNING cursor, done`, [job]);
  return r.rows[0];                 // { cursor, done } — resume point, NEVER assumed 0
}

async function endKey() {
  const r = await db.query(`SELECT max(${KEY}) AS m FROM ${TABLE}`);
  return Number(r.rows[0].m ?? 0);  // the convergence denominator
}

export async function run(job, { dryRun = false } = {}) {
  let { cursor, done } = await readCursor(job);
  const END = await endKey();
  console.log(`RESUME job=${job} cursor=${cursor} done=${done} END=${END}` +
              (dryRun ? "  [DRY-RUN: no writes]" : ""));

  while (cursor < END) {
    const t0 = Date.now();
    const { next, rowsIn, rowsOut, gate } = await processBatch(cursor, { dryRun });

    if (rowsIn === 0) break;        // no rows past cursor but cursor<END ⇒ tail gap; reconcile catches it

    // ── THE GATE (§5). A batch that fails does NOT advance the cursor. ──
    if (!gate.ok) {
      console.error(`HALT at cursor=${cursor}: ${gate.why}  (batch rolled back, cursor unchanged)`);
      process.exit(1);             // residue = this batch; inspect, don't spin
    }

    if (!dryRun) {
      // Checkpoint AFTER the data commit. Re-running a batch is free (idempotent §4),
      // so a crash between the data commit and this line costs one redone batch, nothing more.
      await db.query(
        `UPDATE _backfill_progress SET cursor=$1, done=done+$2, updated_at=now() WHERE job=$3`,
        [next, rowsOut, job]);
    }
    cursor = next; done += rowsOut;

    const ms = Date.now() - t0;
    console.log(`batch key>${cursor - BATCH}..${next}  in=${rowsIn} out=${rowsOut}  ` +
                `done=${done}/${END} (${(100 * done / END).toFixed(2)}%)  ${ms}ms  verify=ok`);

    await backpressure(ms);        // §6 — pace by this batch's latency
  }
  console.log(`cursor reached END (${cursor} >= ${END}). Now RECONCILE before declaring done.`);
}

KEY must be monotonic and indexed (a serial id), not created_at — timestamps collide, so paging by > cursor on a non-unique key either skips rows (ties straddling a batch boundary) or loops forever. If your real key isn't unique-monotonic, page by (created_at, id) as a composite and carry both in the cursor.


4. Component B — the idempotent batch (upsert by key)

One batch = read N rows past the cursor, transform them, and upsert by primary key in a single transaction. The upsert is the whole game: re-running the exact same batch writes the exact same rows to the exact same keys — a no-op if they're already there. That is what makes a crashed/retried batch safe.

async function processBatch(cursor, { dryRun }) {
  const client = await db.connect();
  try {
    // READ: bounded by LIMIT (row count), ordered by the monotonic key, strictly past cursor.
    const src = await client.query(
      `SELECT id, amount, currency FROM ${TABLE}
        WHERE id > $1 ORDER BY id ASC LIMIT $2`, [cursor, BATCH]);
    const rows = src.rows;
    if (rows.length === 0) return { next: cursor, rowsIn: 0, rowsOut: 0, gate: { ok: true } };

    // TRANSFORM: the fix. (Pure function — easy to unit-test and to shadow-diff in dry-run.)
    const out = rows.map(r => ({ id: r.id, total_cents: toCents(r.amount, r.currency) }));

    if (dryRun) {                  // SHADOW: compute + diff, write NOTHING
      const deltas = out.filter((o, i) => o.total_cents !== rows[i].total_cents_existing).length;
      return { next: rows.at(-1).id, rowsIn: rows.length, rowsOut: deltas,
               gate: gateBatch(rows, out) };
    }

    // PROCESS: UPSERT BY KEY in ONE txn. ON CONFLICT makes a re-run idempotent —
    // a batch that ran before lands the same values on the same ids: a no-op, not a double-write.
    await client.query("BEGIN");
    const values = out.map((o, i) => `($${2*i+1}::bigint, $${2*i+2}::bigint)`).join(",");
    const params = out.flatMap(o => [o.id, o.total_cents]);
    const upserted = await client.query(
      `INSERT INTO ${TABLE} (id, total_cents) VALUES ${values}
         ON CONFLICT (id) DO UPDATE SET total_cents = EXCLUDED.total_cents
       RETURNING id`, params);
    const gate = gateBatch(rows, out, upserted.rowCount);
    if (!gate.ok) { await client.query("ROLLBACK"); return { next: cursor, rowsIn: rows.length, rowsOut: 0, gate }; }
    await client.query("COMMIT");

    return { next: rows.at(-1).id, rowsIn: rows.length, rowsOut: upserted.rowCount, gate };
  } finally {
    client.release();
  }
}

Two non-obvious lines carry the design:

  • ON CONFLICT (id) DO UPDATE (idempotency). Without it you have INSERT — and a retried batch inserts duplicate rows (or errors and halts the run). With it, the batch is a function of its input: same input, same output rows, run it once or five times. This is the single line that makes "crash and restart" safe instead of corrupting.
  • ROLLBACK on a failing gate, cursor unchanged. A batch that fails verification leaves the table exactly as it was and the cursor exactly where it was. No half-applied batch on disk, no cursor pointing past unwritten data. The batch becomes residue — inspect it, don't advance.

One transaction per batch, not one for the whole table. A single giant UPDATE over 41M rows holds a lock for hours and rolls back the entire run on any error; per-batch commits bound the lock to ~1–2s and make every batch independently recoverable.


5. The gate (non-negotiable)

A batch is counted done only if all three hold — checked inside the loop, every batch:

function gateBatch(rowsIn, rowsOut, written = rowsOut.length) {
  // (1) COUNT: every row read was written (or a KNOWN, asserted delta — e.g. a documented filter).
  if (written !== rowsIn.length)
    return { ok: false, why: `count: read ${rowsIn.length}, wrote ${written} (silent skip)` };

  // (2) INVARIANTS: the transform's contract. Bad transform shows up here, on THIS batch.
  for (const o of rowsOut) {
    if (o.total_cents == null)        return { ok: false, why: `null total_cents at id=${o.id}` };
    if (o.total_cents < 0)            return { ok: false, why: `negative total_cents=${o.total_cents} id=${o.id}` };
    if (!Number.isInteger(o.total_cents)) return { ok: false, why: `non-integer cents id=${o.id}` };
  }
  return { ok: true };
}

And once, at the very end — the reconciliation gate, the proof of completeness:

-- Cursor reaching END means the loop RAN. Reconciliation proves it was CORRECT & COMPLETE.
-- (1) COUNT must match — catches a silently skipped batch the per-batch count missed.
SELECT (SELECT count(*) FROM orders)                              AS src,
       (SELECT count(*) FROM orders WHERE total_cents IS NOT NULL) AS dst;   -- must be equal

-- (2) CHECKSUM must match — catches wrong VALUES that a count alone can't (right rows, bad data).
SELECT md5(string_agg(id || ':' || total_cents, ',' ORDER BY id)) AS dst_checksum
  FROM orders;                                  -- compare to the independently-computed expected

The rule: a batch that fails per-batch verification didn't migrate — it halts the loop and stays residue. And the job isn't done when the cursor reaches the end — it's done when reconciliation passes. A backfill that "stopped erroring" but whose source and destination counts/checksums don't match didn't happen. The reconcile gate is what makes the job provably complete instead of merely finished, and it's what makes running it unattended safe.

Run the transform's invariants as a unit test too (toCents is a pure function) so a bad transform reddens CI before it ever touches a row — the cheapest possible verify-as-you-go.


6. Backpressure — don't melt the source

The loop reads and writes a live prod DB; an unpaced tight loop spikes replica lag and p99 for real users. Pace by a signal — batch latency here; replica lag if you can read it.

let ewma = 0;                       // exponentially-weighted mean batch latency
async function backpressure(ms) {
  ewma = ewma ? 0.8 * ewma + 0.2 * ms : ms;
  // Healthy batch ≈ 1–2s. If a batch is >2× the running mean, the source is straining → back off.
  if (ms > 2 * ewma && ewma > 0) {
    const nap = Math.min(5000, ms);            // cap the nap; don't stall forever
    console.warn(`backpressure: batch ${ms}ms vs ewma ${ewma | 0}ms — sleeping ${nap}ms`);
    await new Promise(r => setTimeout(r, nap));
  } else {
    await new Promise(r => setTimeout(r, 50)); // a baseline breath between batches
  }
}

Tune BATCH by p99 batch latency: if p99 > ~2s, the per-batch lock and the transaction are too big — halve the batch. If batches are <200ms, you're paying round-trip overhead — double it. The batch size is the one knob that trades retry-cost and lock-time against throughput.


7. When to stop (convergence)

Two signals, and the job is done only when both hold: the cursor reached END and reconciliation passes. The progress table from the 41.2M orders run (5k batches), abridged to the points that matter — including the two crashes it survived:

Eventcursordonecumulative %verifynote
start (fresh)000.00%cursor read from durable table = 0
batch 15,0005,0000.01%okp99 batch ≈ 1.3s
ok
OOM kill @ 12.4M12,400,00012,400,00030.10%process dies; cursor durable
resume12,400,00012,400,00030.10%restart reads cursor — not zero
ok
deploy restart @ 27.9M27,900,00027,900,00067.71%resumes from checkpoint again
ok
last batch41,203,11841,203,118100.00%okcursor reached END
RECONCILEPASSsrc 41,203,118 == dst 41,203,118; checksums equal → DONE

Read it: the cursor reaching END (100%) is not the finish line — the RECONCILE PASS is. Had the count come back 41,203,118 vs 41,201,994, the job would be not done despite the cursor being at the end: ~1,100 rows silently skipped (a batch the per-batch count check should have caught — and if it didn't, your gate has a hole). And had count matched but the checksum differed, you'd have the right number of rows with wrong values — a bad transform that the count alone can't see. Convergence = cursor at END AND count match AND checksum match.

If reconciliation fails, you don't have a stop — you have a bug. Find the gap (diff src vs dst keys to locate the missing/wrong rows), fix the transform or the skipped batch, and re-run: because every batch is idempotent, re-running is safe and only the wrong rows change.


8. How to re-run it

# 1. DRY-RUN first — read-only shadow. Compute + diff the new values, write NOTHING.
#    Eyeball the delta count and any gate failures before you trust the transform on prod.
node -e 'import("./backfill.mjs").then(m=>m.run("orders-total-cents",{dryRun:true}))'

# 2. Run for real. Resumes from the durable cursor; on a fresh job that's 0, on a restart
#    it's wherever it died. Crash it (Ctrl-C / OOM / deploy) any time and re-run THIS SAME line:
node -e 'import("./backfill.mjs").then(m=>m.run("orders-total-cents"))'
#    → "RESUME job=orders-total-cents cursor=12400000 done=12400000 END=41203118"   (not zero)

# 3. RECONCILE — the completeness gate. Count AND checksum, source vs destination.
psql -f reconcile.sql        # src==dst count AND checksums equal ⇒ DONE. mismatch ⇒ find the gap.

Because every batch is idempotent and the cursor is durable, re-running step 2 from scratch is always safe: already-processed rows upsert to the same values (a no-op), the cursor skips them entirely, and only un-processed rows do work. There is no "start over" — there is only "continue".


9. Gotchas (each one cost a real debugging cycle)

SymptomCauseFix
Counts double after a crash + restartBlind INSERT (non-idempotent) — the retried batch re-inserts rows it already wroteUpsert by key (ON CONFLICT … DO UPDATE / MERGE); a re-run must be a no-op — §4
Job never finishes — restarts at 0 every blipNo durable checkpoint; cursor lived in memory and died with the processPersist the cursor to a _backfill_progress row after every batch; read it on start — §3
Backfill takes prod down — p99/replica lag spikesA tight loop with no pacing hammers the source DBBackpressure: sleep proportional to rising batch latency / replica lag — §6
One batch locks the whole table for hoursA single giant transaction over all rowsBatch + commit per batch — bound the lock to ~1–2s, one txn per batch — §4
A "batch" is unboundedly large or emptyBounded by key range (id BETWEEN x AND x+5000) over skewed/sparse keysBound by row count (ORDER BY key LIMIT N), page by key > cursor — §4
Reconcile count is short but no error was thrownSilent partial success — a batch reported "ok" but skipped rowsPer-batch rows_in == rows_out count check halts on a skip; reconcile is the backstop — §5
Counts match but values are wrongBad transform — right rows, wrong data; a count can't see itChecksum in reconciliation (not just count); unit-test the transform — §5
Cursor skips or loops rowsPaging by a non-unique key (created_at) — ties straddle batch boundariesPage by a monotonic unique key (id), or a composite (created_at, id) — §3
Crash leaves a half-applied batch on diskCheckpoint written before the data commit, or no rollback on gate failureCommit data first, then checkpoint; ROLLBACK a failing batch, cursor unchanged — §2/§4
Transform bug discovered after 10M rowsNo per-batch gate — verification deferred to the endVerify-as-you-go: the gate runs inside the loop, fails on batch 3 — §5

10. Why it works

  • Idempotency makes interruption a non-event. Because a batch is a function of its input — upsert by key, no blind inserts — a crash, an OOM, a kill -9, a deploy mid-run all cost at most one redone batch. "Restart it" is the entire recovery procedure.
  • The durable checkpoint makes long jobs finish. Resume-from-cursor instead of from zero is the difference between a 6-hour job that completes across three restarts and one that dies at 9.7M forever. The cursor is the progress, and it's on disk.
  • The per-batch gate makes autonomy safe. Counts + invariants on every batch catch a bad transform at batch 3, halt without advancing, and leave the table clean — so the loop can run unattended without quietly corrupting 10M rows.
  • Backpressure makes it safe to run against prod. Pacing by latency means the backfill yields to real traffic instead of melting the source — a slow correct backfill beats a fast outage.
  • Reconciliation makes "done" honest. Cursor-at-end is "the loop ran"; count-and-checksum match is "every row is correct and accounted for." The job converges to provably complete, not to stopped erroring — and that's the only stop worth trusting.

Generated from a 41.2M-row orders backfill (5k batches, durable cursor, upsert-by-key, survived two OOM restarts + a deploy, reconciled count + checksum at the end). Reusable for any large transformation that must survive interruption and be provably complete — schema backfill, re-index, reprocessing job, data migration.

Install this skill directly: skilldb add agentic-loops-skills

Get CLI access →