data-backfill-loop
A cursor → batch → checkpoint → verify → resume loop for running a transformation over a
A repeatable loop for a transformation over a dataset too big for one shot — backfill a new column from 40M rows, re-index a collection into a new mapping, reprocess every event through a fixed parser. The naïve approach (one `UPDATE … WHERE …` or one streaming script) fails three ways: it locks the table for hours, it dies at row 9.7M on a transient blip and restarts from ## Key Points - **Idempotency is the foundation — a backfill WILL be interrupted.** Every batch must be - **Checkpoint after every batch — resume from the cursor, not from zero.** A 6-hour job that - **Bounded batch size — by row count, not key range.** Small enough that a failed batch is - **Backpressure — the loop paces itself so it doesn't melt the source.** A tight loop over a - **Verify-as-you-go — catch a bad transform at batch 3, not after 10M rows.** The per-batch - **Dry-run / shadow first.** Before mutating anything, run the loop **read-only**: compute the - **Reconciliation proves completeness — cursor-reached-end ≠ all-rows-correct.** The cursor - **`ON CONFLICT (id) DO UPDATE` (idempotency).** Without it you have `INSERT` — and a retried - **`ROLLBACK` on a failing gate, cursor unchanged.** A batch that fails verification leaves the - **Idempotency makes interruption a non-event.** Because a batch is a function of its input — - **The durable checkpoint makes long jobs finish.** Resume-from-cursor instead of from zero is - **The per-batch gate makes autonomy safe.** Counts + invariants on every batch catch a bad
skilldb get agentic-loops-skills/data-backfill-loopFull skill: 416 linesData-Backfill Loop
A repeatable loop for a transformation over a dataset too big for one shot — backfill a new
column from 40M rows, re-index a collection into a new mapping, reprocess every event through a
fixed parser. The naïve approach (one UPDATE … WHERE … or one streaming script) fails three
ways: it locks the table for hours, it dies at row 9.7M on a transient blip and restarts from
zero, and when it finally "finishes" you have no proof every row is actually correct. This
loop instead walks a durable cursor batch by batch, processes each batch idempotently behind
a per-batch gate, checkpoints after every batch, and resumes from the cursor — never zero — on
restart, then proves completeness with a source-vs-destination reconciliation.
The whole design assumes the truth about long jobs: it WILL be interrupted. A deploy, an
OOM kill, a network blip, a kill -9. So idempotency isn't a nicety — it's the foundation. If
re-running a batch double-counts, every retry corrupts data; if there's no checkpoint, every
blip costs you the whole run. Get those two right and the job becomes boring: it crashes, you
restart it, it picks up where it left off, and at the end the numbers match.
It was run against a 41.2M-row orders backfill (compute total_cents from a fixed currency
bug): batched at 5k, checkpointed per batch, survived two OOM restarts and one deploy, and
reconciled 41,203,118 == 41,203,118 with matching checksums at the end.
1. The philosophy
- Idempotency is the foundation — a backfill WILL be interrupted. Every batch must be
safely re-runnable: upsert by key (
INSERT … ON CONFLICT … DO UPDATE,MERGE), never a blindINSERT. A batch that crashed after writing 3k of 5k rows but before checkpointing gets re-run from the same cursor — if the write isn't idempotent, those 3k rows get written twice and your counts are silently wrong forever. Idempotent means a crashed/retried batch is a no-op on the rows it already did. - Checkpoint after every batch — resume from the cursor, not from zero. A 6-hour job that
restarts at 0 on a transient blip never finishes — it dies again before reaching the end,
every time. Persist the cursor to durable storage (a row in a
_backfill_progresstable, not a variable in memory) after each batch commits. Restart reads the cursor and continues. - Bounded batch size — by row count, not key range. Small enough that a failed batch is
cheap to retry and the per-batch transaction bounds its lock; large enough to amortize
round-trips. Tune by p99 batch latency (target ~1–2s/batch). And bound by
LIMIT N, not by a key range (id BETWEEN x AND x+5000) — skewed/sparse keys make a range "batch" unboundedly large or empty. - Backpressure — the loop paces itself so it doesn't melt the source. A tight loop over a live prod DB will spike replication lag and p99 for real traffic. Watch a signal (batch latency, replica lag) and sleep proportionally when it rises. A backfill that takes down prod is worse than a slow backfill.
- Verify-as-you-go — catch a bad transform at batch 3, not after 10M rows. The per-batch gate runs inside the loop. If the transform is wrong, you find out 15 seconds in on 15k rows, not 5 hours in on 10M you now have to undo.
- Dry-run / shadow first. Before mutating anything, run the loop read-only: compute the new value, diff it against the old, log the deltas — don't write. A bad transform caught in shadow mode costs nothing; caught in prod costs a restore.
- Reconciliation proves completeness — cursor-reached-end ≠ all-rows-correct. The cursor hitting the last row only means the loop ran; it doesn't mean every row is right (a silently skipped batch, a filter that dropped rows). The final gate is a full source-vs-destination count and checksum. Done means that passes, not that the loop stopped erroring.
2. The loop (one batch)
┌──────────────────────────────────────────────────────────────────────┐
│ ON START: read cursor from _backfill_progress (durable). │
│ cursor = last_key (resume) OR 0 (fresh). NEVER assume 0. │
│ │
│ while cursor < END: │
│ 1. READ next batch: rows WHERE key > cursor ORDER BY key │
│ LIMIT N (bounded by ROW COUNT, not key range) │
│ 2. PROCESS transform + UPSERT BY KEY in ONE txn (idempotent — │
│ a re-run of this batch is a no-op on rows it did) │
│ 3. VERIFY rows_in == rows_out (or known delta) AND invariants │
│ hold AND no nulls where forbidden ── the GATE │
│ 4a. PASS → COMMIT, then persist cursor = max(key) in batch │
│ (checkpoint AFTER the data commit, same or next txn) │
│ 4b. FAIL → ROLLBACK the batch, HALT. Do NOT advance the cursor. │
│ (residue = this batch; a human/agent inspects it) │
│ 5. BACKPRESSURE sleep ∝ rising batch latency / replica lag │
└─────────────────────────────────────────────┬────────────────────────┘
│
cursor reached END? ── no ──► next batch from new checkpoint
│ yes
▼
RECONCILE (whole-table): source COUNT/checksum == dest COUNT/checksum
→ pass = DONE. mismatch = NOT done (find the gap).
The checkpoint sits after the data write and the cursor only ever moves on a passing, committed batch. That single ordering is what makes a crash survivable: you can crash before the data commits (batch re-runs cleanly — idempotent), or after data commits but before the cursor persists (batch re-runs cleanly — idempotent), but you can never end up with the cursor ahead of the data. The worst case is re-doing one already-done batch, which is free.
3. Component A — the driver (durable cursor + resume)
The driver reads the cursor from durable storage on start, walks batches, and checkpoints after each one. On restart it reads the same cursor and continues — there is no "from zero" path.
backfill.mjs:
import pg from "pg";
const db = new pg.Pool({ max: 4 });
const TABLE = "orders";
const KEY = "id"; // monotonic, indexed key to page by — NOT created_at (dupes)
const BATCH = 5000; // bounded by ROW COUNT (tune by p99 latency, below)
// Durable cursor. One row per job; survives process death. Created once:
// CREATE TABLE IF NOT EXISTS _backfill_progress(
// job text PRIMARY KEY, cursor bigint NOT NULL DEFAULT 0,
// done bigint NOT NULL DEFAULT 0, updated_at timestamptz NOT NULL DEFAULT now());
async function readCursor(job) {
const r = await db.query(
`INSERT INTO _backfill_progress(job) VALUES($1)
ON CONFLICT(job) DO UPDATE SET updated_at = now() -- idempotent ensure-exists
RETURNING cursor, done`, [job]);
return r.rows[0]; // { cursor, done } — resume point, NEVER assumed 0
}
async function endKey() {
const r = await db.query(`SELECT max(${KEY}) AS m FROM ${TABLE}`);
return Number(r.rows[0].m ?? 0); // the convergence denominator
}
export async function run(job, { dryRun = false } = {}) {
let { cursor, done } = await readCursor(job);
const END = await endKey();
console.log(`RESUME job=${job} cursor=${cursor} done=${done} END=${END}` +
(dryRun ? " [DRY-RUN: no writes]" : ""));
while (cursor < END) {
const t0 = Date.now();
const { next, rowsIn, rowsOut, gate } = await processBatch(cursor, { dryRun });
if (rowsIn === 0) break; // no rows past cursor but cursor<END ⇒ tail gap; reconcile catches it
// ── THE GATE (§5). A batch that fails does NOT advance the cursor. ──
if (!gate.ok) {
console.error(`HALT at cursor=${cursor}: ${gate.why} (batch rolled back, cursor unchanged)`);
process.exit(1); // residue = this batch; inspect, don't spin
}
if (!dryRun) {
// Checkpoint AFTER the data commit. Re-running a batch is free (idempotent §4),
// so a crash between the data commit and this line costs one redone batch, nothing more.
await db.query(
`UPDATE _backfill_progress SET cursor=$1, done=done+$2, updated_at=now() WHERE job=$3`,
[next, rowsOut, job]);
}
cursor = next; done += rowsOut;
const ms = Date.now() - t0;
console.log(`batch key>${cursor - BATCH}..${next} in=${rowsIn} out=${rowsOut} ` +
`done=${done}/${END} (${(100 * done / END).toFixed(2)}%) ${ms}ms verify=ok`);
await backpressure(ms); // §6 — pace by this batch's latency
}
console.log(`cursor reached END (${cursor} >= ${END}). Now RECONCILE before declaring done.`);
}
KEY must be monotonic and indexed (a serial id), not created_at — timestamps collide,
so paging by > cursor on a non-unique key either skips rows (ties straddling a batch boundary)
or loops forever. If your real key isn't unique-monotonic, page by (created_at, id) as a
composite and carry both in the cursor.
4. Component B — the idempotent batch (upsert by key)
One batch = read N rows past the cursor, transform them, and upsert by primary key in a single transaction. The upsert is the whole game: re-running the exact same batch writes the exact same rows to the exact same keys — a no-op if they're already there. That is what makes a crashed/retried batch safe.
async function processBatch(cursor, { dryRun }) {
const client = await db.connect();
try {
// READ: bounded by LIMIT (row count), ordered by the monotonic key, strictly past cursor.
const src = await client.query(
`SELECT id, amount, currency FROM ${TABLE}
WHERE id > $1 ORDER BY id ASC LIMIT $2`, [cursor, BATCH]);
const rows = src.rows;
if (rows.length === 0) return { next: cursor, rowsIn: 0, rowsOut: 0, gate: { ok: true } };
// TRANSFORM: the fix. (Pure function — easy to unit-test and to shadow-diff in dry-run.)
const out = rows.map(r => ({ id: r.id, total_cents: toCents(r.amount, r.currency) }));
if (dryRun) { // SHADOW: compute + diff, write NOTHING
const deltas = out.filter((o, i) => o.total_cents !== rows[i].total_cents_existing).length;
return { next: rows.at(-1).id, rowsIn: rows.length, rowsOut: deltas,
gate: gateBatch(rows, out) };
}
// PROCESS: UPSERT BY KEY in ONE txn. ON CONFLICT makes a re-run idempotent —
// a batch that ran before lands the same values on the same ids: a no-op, not a double-write.
await client.query("BEGIN");
const values = out.map((o, i) => `($${2*i+1}::bigint, $${2*i+2}::bigint)`).join(",");
const params = out.flatMap(o => [o.id, o.total_cents]);
const upserted = await client.query(
`INSERT INTO ${TABLE} (id, total_cents) VALUES ${values}
ON CONFLICT (id) DO UPDATE SET total_cents = EXCLUDED.total_cents
RETURNING id`, params);
const gate = gateBatch(rows, out, upserted.rowCount);
if (!gate.ok) { await client.query("ROLLBACK"); return { next: cursor, rowsIn: rows.length, rowsOut: 0, gate }; }
await client.query("COMMIT");
return { next: rows.at(-1).id, rowsIn: rows.length, rowsOut: upserted.rowCount, gate };
} finally {
client.release();
}
}
Two non-obvious lines carry the design:
ON CONFLICT (id) DO UPDATE(idempotency). Without it you haveINSERT— and a retried batch inserts duplicate rows (or errors and halts the run). With it, the batch is a function of its input: same input, same output rows, run it once or five times. This is the single line that makes "crash and restart" safe instead of corrupting.ROLLBACKon a failing gate, cursor unchanged. A batch that fails verification leaves the table exactly as it was and the cursor exactly where it was. No half-applied batch on disk, no cursor pointing past unwritten data. The batch becomes residue — inspect it, don't advance.
One transaction per batch, not one for the whole table. A single giant UPDATE over 41M rows
holds a lock for hours and rolls back the entire run on any error; per-batch commits bound the
lock to ~1–2s and make every batch independently recoverable.
5. The gate (non-negotiable)
A batch is counted done only if all three hold — checked inside the loop, every batch:
function gateBatch(rowsIn, rowsOut, written = rowsOut.length) {
// (1) COUNT: every row read was written (or a KNOWN, asserted delta — e.g. a documented filter).
if (written !== rowsIn.length)
return { ok: false, why: `count: read ${rowsIn.length}, wrote ${written} (silent skip)` };
// (2) INVARIANTS: the transform's contract. Bad transform shows up here, on THIS batch.
for (const o of rowsOut) {
if (o.total_cents == null) return { ok: false, why: `null total_cents at id=${o.id}` };
if (o.total_cents < 0) return { ok: false, why: `negative total_cents=${o.total_cents} id=${o.id}` };
if (!Number.isInteger(o.total_cents)) return { ok: false, why: `non-integer cents id=${o.id}` };
}
return { ok: true };
}
And once, at the very end — the reconciliation gate, the proof of completeness:
-- Cursor reaching END means the loop RAN. Reconciliation proves it was CORRECT & COMPLETE.
-- (1) COUNT must match — catches a silently skipped batch the per-batch count missed.
SELECT (SELECT count(*) FROM orders) AS src,
(SELECT count(*) FROM orders WHERE total_cents IS NOT NULL) AS dst; -- must be equal
-- (2) CHECKSUM must match — catches wrong VALUES that a count alone can't (right rows, bad data).
SELECT md5(string_agg(id || ':' || total_cents, ',' ORDER BY id)) AS dst_checksum
FROM orders; -- compare to the independently-computed expected
The rule: a batch that fails per-batch verification didn't migrate — it halts the loop and stays residue. And the job isn't done when the cursor reaches the end — it's done when reconciliation passes. A backfill that "stopped erroring" but whose source and destination counts/checksums don't match didn't happen. The reconcile gate is what makes the job provably complete instead of merely finished, and it's what makes running it unattended safe.
Run the transform's invariants as a unit test too (toCents is a pure function) so a bad
transform reddens CI before it ever touches a row — the cheapest possible verify-as-you-go.
6. Backpressure — don't melt the source
The loop reads and writes a live prod DB; an unpaced tight loop spikes replica lag and p99 for real users. Pace by a signal — batch latency here; replica lag if you can read it.
let ewma = 0; // exponentially-weighted mean batch latency
async function backpressure(ms) {
ewma = ewma ? 0.8 * ewma + 0.2 * ms : ms;
// Healthy batch ≈ 1–2s. If a batch is >2× the running mean, the source is straining → back off.
if (ms > 2 * ewma && ewma > 0) {
const nap = Math.min(5000, ms); // cap the nap; don't stall forever
console.warn(`backpressure: batch ${ms}ms vs ewma ${ewma | 0}ms — sleeping ${nap}ms`);
await new Promise(r => setTimeout(r, nap));
} else {
await new Promise(r => setTimeout(r, 50)); // a baseline breath between batches
}
}
Tune BATCH by p99 batch latency: if p99 > ~2s, the per-batch lock and the transaction are
too big — halve the batch. If batches are <200ms, you're paying round-trip overhead — double it.
The batch size is the one knob that trades retry-cost and lock-time against throughput.
7. When to stop (convergence)
Two signals, and the job is done only when both hold: the cursor reached END and
reconciliation passes. The progress table from the 41.2M orders run (5k batches), abridged to
the points that matter — including the two crashes it survived:
| Event | cursor | done | cumulative % | verify | note |
|---|---|---|---|---|---|
| start (fresh) | 0 | 0 | 0.00% | — | cursor read from durable table = 0 |
| batch 1 | 5,000 | 5,000 | 0.01% | ok | p99 batch ≈ 1.3s |
| … | … | … | … | ok | |
| OOM kill @ 12.4M | 12,400,000 | 12,400,000 | 30.10% | — | process dies; cursor durable |
| resume | 12,400,000 | 12,400,000 | 30.10% | — | restart reads cursor — not zero |
| … | … | … | … | ok | |
| deploy restart @ 27.9M | 27,900,000 | 27,900,000 | 67.71% | — | resumes from checkpoint again |
| … | … | … | … | ok | |
| last batch | 41,203,118 | 41,203,118 | 100.00% | ok | cursor reached END |
| RECONCILE | — | — | — | PASS | src 41,203,118 == dst 41,203,118; checksums equal → DONE |
Read it: the cursor reaching END (100%) is not the finish line — the RECONCILE PASS is.
Had the count come back 41,203,118 vs 41,201,994, the job would be not done despite the
cursor being at the end: ~1,100 rows silently skipped (a batch the per-batch count check should
have caught — and if it didn't, your gate has a hole). And had count matched but the checksum
differed, you'd have the right number of rows with wrong values — a bad transform that the
count alone can't see. Convergence = cursor at END AND count match AND checksum match.
If reconciliation fails, you don't have a stop — you have a bug. Find the gap (diff src vs
dst keys to locate the missing/wrong rows), fix the transform or the skipped batch, and re-run:
because every batch is idempotent, re-running is safe and only the wrong rows change.
8. How to re-run it
# 1. DRY-RUN first — read-only shadow. Compute + diff the new values, write NOTHING.
# Eyeball the delta count and any gate failures before you trust the transform on prod.
node -e 'import("./backfill.mjs").then(m=>m.run("orders-total-cents",{dryRun:true}))'
# 2. Run for real. Resumes from the durable cursor; on a fresh job that's 0, on a restart
# it's wherever it died. Crash it (Ctrl-C / OOM / deploy) any time and re-run THIS SAME line:
node -e 'import("./backfill.mjs").then(m=>m.run("orders-total-cents"))'
# → "RESUME job=orders-total-cents cursor=12400000 done=12400000 END=41203118" (not zero)
# 3. RECONCILE — the completeness gate. Count AND checksum, source vs destination.
psql -f reconcile.sql # src==dst count AND checksums equal ⇒ DONE. mismatch ⇒ find the gap.
Because every batch is idempotent and the cursor is durable, re-running step 2 from scratch is always safe: already-processed rows upsert to the same values (a no-op), the cursor skips them entirely, and only un-processed rows do work. There is no "start over" — there is only "continue".
9. Gotchas (each one cost a real debugging cycle)
| Symptom | Cause | Fix |
|---|---|---|
| Counts double after a crash + restart | Blind INSERT (non-idempotent) — the retried batch re-inserts rows it already wrote | Upsert by key (ON CONFLICT … DO UPDATE / MERGE); a re-run must be a no-op — §4 |
| Job never finishes — restarts at 0 every blip | No durable checkpoint; cursor lived in memory and died with the process | Persist the cursor to a _backfill_progress row after every batch; read it on start — §3 |
| Backfill takes prod down — p99/replica lag spikes | A tight loop with no pacing hammers the source DB | Backpressure: sleep proportional to rising batch latency / replica lag — §6 |
| One batch locks the whole table for hours | A single giant transaction over all rows | Batch + commit per batch — bound the lock to ~1–2s, one txn per batch — §4 |
| A "batch" is unboundedly large or empty | Bounded by key range (id BETWEEN x AND x+5000) over skewed/sparse keys | Bound by row count (ORDER BY key LIMIT N), page by key > cursor — §4 |
| Reconcile count is short but no error was thrown | Silent partial success — a batch reported "ok" but skipped rows | Per-batch rows_in == rows_out count check halts on a skip; reconcile is the backstop — §5 |
| Counts match but values are wrong | Bad transform — right rows, wrong data; a count can't see it | Checksum in reconciliation (not just count); unit-test the transform — §5 |
| Cursor skips or loops rows | Paging by a non-unique key (created_at) — ties straddle batch boundaries | Page by a monotonic unique key (id), or a composite (created_at, id) — §3 |
| Crash leaves a half-applied batch on disk | Checkpoint written before the data commit, or no rollback on gate failure | Commit data first, then checkpoint; ROLLBACK a failing batch, cursor unchanged — §2/§4 |
| Transform bug discovered after 10M rows | No per-batch gate — verification deferred to the end | Verify-as-you-go: the gate runs inside the loop, fails on batch 3 — §5 |
10. Why it works
- Idempotency makes interruption a non-event. Because a batch is a function of its input —
upsert by key, no blind inserts — a crash, an OOM, a
kill -9, a deploy mid-run all cost at most one redone batch. "Restart it" is the entire recovery procedure. - The durable checkpoint makes long jobs finish. Resume-from-cursor instead of from zero is the difference between a 6-hour job that completes across three restarts and one that dies at 9.7M forever. The cursor is the progress, and it's on disk.
- The per-batch gate makes autonomy safe. Counts + invariants on every batch catch a bad transform at batch 3, halt without advancing, and leave the table clean — so the loop can run unattended without quietly corrupting 10M rows.
- Backpressure makes it safe to run against prod. Pacing by latency means the backfill yields to real traffic instead of melting the source — a slow correct backfill beats a fast outage.
- Reconciliation makes "done" honest. Cursor-at-end is "the loop ran"; count-and-checksum match is "every row is correct and accounted for." The job converges to provably complete, not to stopped erroring — and that's the only stop worth trusting.
Generated from a 41.2M-row orders backfill (5k batches, durable cursor, upsert-by-key,
survived two OOM restarts + a deploy, reconciled count + checksum at the end). Reusable for any
large transformation that must survive interruption and be provably complete — schema backfill,
re-index, reprocessing job, data migration.
Install this skill directly: skilldb add agentic-loops-skills
Related Skills
bug-hunt-loop
An adversarial find → dedup → verify → fix loop that audits a codebase or PR for REAL
eval-driven-loop
An eval → improve-one-thing → re-eval hill-climbing loop for developing an LLM feature
migration-loop
A scout → pipeline → gate-each → residue-loop pattern for a large MECHANICAL change
refactor-under-tests-loop
A characterize → green → tiny-refactor → green loop for restructuring code WITHOUT
research-synthesis-loop
A gather → synthesize → critique-gaps → fill loop that builds a comprehensive, fully-cited
self-improvement-loop
A screenshot → critique → improve-one-thing → test loop that systematically develops an