Skip to main content
UncategorizedProduction Audit442 lines

Recovery & Resume Audit

Quick Summary35 lines
Verify that interrupted workflows can restart safely without losing progress, duplicating work, or leaving the system in an inconsistent state. This audit targets the most frustrating class of production bugs: the ones where users lose work, see phantom failures, or must manually clean up after interruptions.

## Key Points

1. Identify all long-running workflows (> 5 seconds)
2. Access job/task state storage (DB table, Redis, queue)
3. Simulate network interruption (browser DevTools offline mode)
4. Simulate server-side interruption (kill worker process)
5. Observe job state transitions (logs, DB queries, admin panel)
6. Trigger workflows via API (bypassing UI debounce/guards)
1. Start a long-running workflow (e.g., generate assets for a project with 20 items).
2. Wait until approximately 5 items are complete (visible in UI or logs).
3. Hard-refresh the browser (Ctrl+Shift+R / Cmd+Shift+R).
4. Observe the page after reload.
- [ ] The workflow continues processing on the server (not cancelled by disconnect).
- [ ] The UI reflects current progress after reload (shows ~5/20 complete).

## Quick Example

```
[ ] Mutex / lock on workflow trigger (DB row lock, Redis lock, etc.)
[ ] Idempotency key on create endpoints
[ ] UI disables trigger button on click (optimistic)
[ ] Server checks for existing in-progress job before creating new one
```

```
[ ] Each item completion is individually recorded (not just batch-level)
[ ] Checkpoint stored in durable storage (DB, not just memory)
[ ] Resume logic: SELECT items WHERE batch_id = ? AND status != 'complete'
[ ] Item-level status: pending | processing | complete | failed
[ ] Batch-level rollup recalculated from item statuses
```
skilldb get production-audit-skills/recovery-resume-auditFull skill: 442 lines
Paste into your CLAUDE.md or agent config

Recovery & Resume Audit

Purpose

Verify that interrupted workflows can restart safely without losing progress, duplicating work, or leaving the system in an inconsistent state. This audit targets the most frustrating class of production bugs: the ones where users lose work, see phantom failures, or must manually clean up after interruptions.

Real users close tabs, lose network, refresh impatiently, and click retry on things that already succeeded. The system must handle all of this gracefully.


Scope

Failure ModeWhat We Test
Browser refresh mid-workflowDoes progress survive? Does the UI reflect actual state?
Network interruptionDoes the backend continue? Does the client recover?
Double trigger (retry)Does the system deduplicate or create duplicates?
Server restart mid-jobDoes the job resume or restart safely?
Timeout after successClient times out but server completed; is the result visible?
Partial batch completion7/20 items done, process dies; does it resume from item 8?
Worker crash mid-processingIs the job retried? Is partial output cleaned up?

Risk Pattern Table

PatternWhat It HitsRiskSymptom
No checkpoint/bookmarkBatch jobsHIGHInterrupted batch restarts from item 1, reprocessing completed items
Client-side only progressUI, UXHIGHRefresh loses all progress indication; user retriggers
Fire-and-forget mutationsAPI, DataHIGHClient disconnects; server may or may not have committed
Missing terminal statesState machineHIGHJobs stuck in "processing" forever after worker crash
Optimistic UI without reconciliationUIMEDIUMUI shows success but server failed; no correction displayed
No idempotency on retryAPI, DataHIGHRetry creates duplicate records, charges, or notifications
Transaction spanning external callsDB, APIHIGHDB transaction committed but external call failed; inconsistent
Missing cleanup on failureStorage, DataMEDIUMFailed job leaves orphaned files, partial records
Stale progress cacheUI, APIMEDIUMProgress bar shows old values after resume
No distinction between "never started" and "failed"State machineMEDIUMCannot tell if job needs retry or hasn't been picked up yet

Pre-Audit Requirements

Before testing, ensure you can:

1. Identify all long-running workflows (> 5 seconds)
2. Access job/task state storage (DB table, Redis, queue)
3. Simulate network interruption (browser DevTools offline mode)
4. Simulate server-side interruption (kill worker process)
5. Observe job state transitions (logs, DB queries, admin panel)
6. Trigger workflows via API (bypassing UI debounce/guards)

Concrete Test Cases

TEST-RR-001: Refresh Midway Through Workflow

Objective: Verify that refreshing the browser during a long-running operation does not lose progress or create duplicates.

Steps:

  1. Start a long-running workflow (e.g., generate assets for a project with 20 items).
  2. Wait until approximately 5 items are complete (visible in UI or logs).
  3. Hard-refresh the browser (Ctrl+Shift+R / Cmd+Shift+R).
  4. Observe the page after reload.

Pass Criteria:

  • The workflow continues processing on the server (not cancelled by disconnect).
  • The UI reflects current progress after reload (shows ~5/20 complete).
  • No duplicate items are created.
  • The "Generate" button is disabled or shows "In Progress" (not available for re-trigger).
  • Completed items are accessible and correct.

Fail Criteria:

  • Workflow restarts from item 1.
  • UI shows 0/20 or no progress indicator.
  • Generate button is active again, inviting a duplicate trigger.
  • Duplicate items appear in the output.

TEST-RR-002: Trigger Workflow Twice (Double Submit)

Objective: Verify that triggering the same workflow twice does not create duplicate work.

Steps:

  1. Start a generation/processing workflow.
  2. Within 2 seconds, trigger it again (via API, double-click, or second tab).
  3. Wait for completion.
  4. Inspect results.

Pass Criteria:

  • Second trigger is rejected with clear message ("Already in progress").
  • OR second trigger is deduplicated silently (same job ID returned).
  • Only one set of results exists.
  • Only one set of external API calls was made (check provider logs/billing).
  • Job state is clean (not two overlapping "processing" entries).

Fail Criteria:

  • Two parallel jobs execute for the same work.
  • Duplicate results created.
  • Double billing on external provider.
  • Race condition: both jobs write to same output, corrupting it.

Implementation Check:

[ ] Mutex / lock on workflow trigger (DB row lock, Redis lock, etc.)
[ ] Idempotency key on create endpoints
[ ] UI disables trigger button on click (optimistic)
[ ] Server checks for existing in-progress job before creating new one

TEST-RR-003: Force Timeout After Server-Side Success

Objective: Verify that results are preserved when the client times out but the server completed successfully.

Steps:

  1. Start a workflow.
  2. Simulate client timeout: either kill the browser tab, go offline (DevTools), or set a very short client-side timeout.
  3. Wait for the server to complete the job (monitor via logs or DB).
  4. Re-open the page / re-request the resource.

Pass Criteria:

  • The completed result is visible and correct.
  • Job state shows "completed" (not "failed" or "timeout").
  • No error is displayed to the user on revisit.
  • The result is identical to a non-interrupted run.

Fail Criteria:

  • Job marked as "failed" because client disconnected.
  • Result exists but UI shows "error" or "unknown state".
  • Webhook/callback was sent to disconnected client and lost; no retry.

TEST-RR-004: Interrupt Batch at Item 7/20, Verify Resume from Item 8

Objective: Verify that batch processing supports checkpointing and resumes from the last successful item.

Steps:

  1. Start a batch operation on exactly 20 items.
  2. Monitor progress until item 7 completes.
  3. Kill the worker process (simulate crash).
  4. Restart the worker.
  5. Observe what happens to the batch.

Pass Criteria:

  • Items 1-7 are marked as complete and their outputs are preserved.
  • Processing resumes from item 8 (not item 1).
  • Items 1-7 are NOT reprocessed (no duplicate external calls).
  • Total result after completion contains all 20 items, each processed exactly once.
  • Batch state accurately reflects: 7 complete, 13 remaining (then progresses to 20 complete).

Fail Criteria:

  • Batch restarts from item 1, reprocessing all 20 items.
  • Items 1-7 results are lost; batch shows 0/20.
  • Batch is stuck in "processing" with no auto-recovery.
  • Item 7 is partially written / corrupted.

Checkpoint Implementation Check:

[ ] Each item completion is individually recorded (not just batch-level)
[ ] Checkpoint stored in durable storage (DB, not just memory)
[ ] Resume logic: SELECT items WHERE batch_id = ? AND status != 'complete'
[ ] Item-level status: pending | processing | complete | failed
[ ] Batch-level rollup recalculated from item statuses

TEST-RR-005: State Accuracy After Various Interruptions

Objective: Verify that job/workflow states accurately reflect reality after interruptions.

Steps: For each state in the system, verify it is reachable and accurate:

StateHow to ReachVerification
queuedSubmit job, check before worker picks upJob exists in queue, no output yet
processingCheck during active processingWorker is actively processing, progress updating
completedLet job finish normallyAll outputs exist, all items successful
failedTrigger known failure (bad input, provider down)Error recorded, partial output cleaned or marked
partialKill worker mid-batchSome items complete, others pending/failed
retryingFail once, observe retryRetry count incremented, previous attempt recorded
cancelledCancel during processingNo further processing occurs, partial output accessible
timeoutExceed time limitDistinguished from "processing"; retry eligible
stuckShould NOT exist as valid stateJobs in "processing" for > 2x expected duration flagged

Pass Criteria:

  • Every state in the table above is explicitly defined in the codebase.
  • Every state is reachable via a real scenario (not just theoretical).
  • No job can be in "processing" for longer than max_duration without being flagged.
  • "Partial" state exists and is distinct from "failed" and "completed".
  • UI accurately reflects each state with clear messaging.

TEST-RR-006: Concurrent Resume After Network Partition

Objective: Verify that a network partition does not cause two workers to process the same job.

Steps:

  1. Start a job on Worker A.
  2. Simulate network partition (Worker A loses DB connectivity but keeps processing).
  3. Job visibility timeout expires; Worker B picks up the "abandoned" job.
  4. Worker A reconnects and attempts to write results.

Pass Criteria:

  • Only one worker's results are committed.
  • The other worker detects the conflict and backs off.
  • No duplicate outputs.
  • Job state is consistent (not marked complete by both).

Implementation Check:

[ ] Job locking with lease/heartbeat (not permanent lock)
[ ] Lease timeout shorter than job timeout
[ ] Write-time version check (optimistic concurrency)
[ ] Worker checks lock ownership before writing results

TEST-RR-007: Progress Persistence Across Sessions

Objective: Verify that progress information survives session changes.

Steps:

  1. Start a long workflow (20+ items).
  2. Close the browser entirely.
  3. Open a new browser session and navigate to the same page.
  4. Check if progress is visible and accurate.
  5. Wait for completion. Verify results are accessible.

Pass Criteria:

  • Progress is stored server-side (not just in browser state/localStorage).
  • New session shows current progress without re-triggering.
  • Completed items are immediately accessible.
  • No "stale" progress (showing old numbers from previous session).

TEST-RR-008: Graceful Degradation on External Service Failure

Objective: Verify that external service failures during a workflow result in clear, recoverable state.

Steps:

  1. Start a workflow that depends on an external service (AI provider, storage, email).
  2. Mid-workflow, simulate the external service going down (mock 500 errors).
  3. Observe behavior: does the workflow retry, fail gracefully, or hang?
  4. Restore the external service.
  5. Retry or resume the workflow.

Pass Criteria:

  • Items that failed due to external service are marked with specific error (not generic "Unknown error").
  • Items that succeeded before the outage are preserved.
  • Retry targets only the failed items.
  • External service errors are logged with response details.
  • User sees actionable message: "3 items failed due to provider timeout. Retry?"

Fail Criteria:

  • Entire batch marked as failed, losing successful items.
  • Generic error with no diagnostic information.
  • Infinite retry loop against down service.
  • No way to retry just the failed items.

Recovery Architecture Checklist

CHECKPOINT STORAGE
[ ] Durable storage for job progress (database, not memory/Redis-only)
[ ] Per-item completion tracking for batch operations
[ ] Progress queryable by job ID from any server instance
[ ] Checkpoint writes are atomic (no partial checkpoint)

RETRY LOGIC
[ ] Retry count limit configured (max 3-5 retries)
[ ] Exponential backoff between retries
[ ] Jitter added to prevent thundering herd
[ ] Distinct handling: retryable errors vs permanent failures
[ ] Retry reuses idempotency key (no duplicates)

STATE TRANSITIONS
[ ] All valid states are enumerated in code (enum, not strings)
[ ] Invalid transitions are rejected (e.g., completed -> processing)
[ ] Terminal states: completed, failed, cancelled (no further transitions)
[ ] Timeout detection: jobs in "processing" beyond max_duration are flagged
[ ] "Partial" state exists for batch operations

UI RECOVERY
[ ] Polling or websocket for live progress updates
[ ] Reconnection logic when websocket drops
[ ] Page load fetches current state from server (not cache)
[ ] Clear distinction between "loading state" and "no active job"
[ ] Retry button only appears when retry is safe

State Verification Query Template

-- Find jobs stuck in non-terminal states
SELECT id, status, started_at, updated_at,
       EXTRACT(EPOCH FROM (NOW() - updated_at)) as seconds_since_update
FROM jobs
WHERE status IN ('processing', 'queued', 'retrying')
  AND updated_at < NOW() - INTERVAL '10 minutes'
ORDER BY updated_at ASC;

-- Verify batch checkpoint accuracy
SELECT batch_id,
       COUNT(*) as total_items,
       COUNT(*) FILTER (WHERE status = 'complete') as completed,
       COUNT(*) FILTER (WHERE status = 'failed') as failed,
       COUNT(*) FILTER (WHERE status = 'pending') as pending
FROM batch_items
GROUP BY batch_id
HAVING COUNT(*) FILTER (WHERE status = 'processing') > 0
   AND MAX(updated_at) < NOW() - INTERVAL '5 minutes';
-- ^ These are stuck batches: items "processing" but no updates in 5 min

Post-Audit Checklist

[ ] All long-running workflows have checkpoint/bookmark capability
[ ] Browser refresh during any workflow shows accurate progress
[ ] Double-trigger is safely deduplicated
[ ] Batch operations resume from last checkpoint after interruption
[ ] All job states are explicitly defined and reachable
[ ] Stuck job detection exists (processing > max_duration)
[ ] External service failures are isolated to affected items only
[ ] Retry targets only failed items, not the entire batch
[ ] Progress is persisted server-side, not only in client state
[ ] Error messages include enough detail for user to decide on retry

What Earlier Audits Miss

Standard testing verifies that workflows complete successfully. This audit matters because:

  • Happy-path tests never interrupt a workflow mid-execution. They miss that progress is stored only in memory.
  • Error handling tests verify that errors are caught, not that the system can resume from the error point.
  • Retry tests verify the retry mechanism works, not that it is idempotent and does not duplicate work.
  • UI tests verify rendering, not that refreshing mid-operation preserves state.
  • Load tests verify throughput under sustained load, not behavior when load is interrupted.

This would be called a Recovery & Resume Audit -- specifically testing whether interrupted workflows restart safely without data loss, duplication, or inconsistent state under browser refresh, network loss, server restart, and worker crash conditions.


Automation Opportunities

TestAutomatable?Method
TEST-RR-001: Refresh mid-workflowPARTIALSelenium: trigger workflow, refresh, assert progress visible
TEST-RR-002: Double submitYESConcurrent API requests with same payload; assert deduplication
TEST-RR-003: Timeout after successYESMock slow response, kill client, verify result persists
TEST-RR-004: Batch checkpointYESStart batch, kill worker at item 7, restart, assert resume from 8
TEST-RR-005: State accuracyYESPut entities in each state via API; verify against expected
TEST-RR-006: Concurrent resumePARTIALRequires simulating network partition; complex test setup
TEST-RR-007: Progress persistenceYESStart workflow, clear session, reopen, assert progress visible
TEST-RR-008: External failureYESMock external service errors mid-batch; verify partial success
# Automated double-submit test
KEY=$(uuidgen)
curl -X POST /api/generate \
  -H "X-Idempotency-Key: $KEY" \
  -d '{"project_id": "test-123"}' &
curl -X POST /api/generate \
  -H "X-Idempotency-Key: $KEY" \
  -d '{"project_id": "test-123"}' &
wait
# Assert: only one job created in database
JOB_COUNT=$(psql -t -A -c "SELECT COUNT(*) FROM jobs WHERE project_id = 'test-123' AND status != 'cancelled'")
[ "$JOB_COUNT" -eq 1 ] && echo "PASS" || echo "FAIL: $JOB_COUNT jobs created"

Reusable Audit Report Template

# Recovery & Resume Audit Report

## System: _______________
## Date: YYYY-MM-DD
## Auditor: _______________

## Long-Running Workflows Identified
| Workflow | Duration | Checkpoint? | Resume? | Idempotent Retry? |
|----------|---------|------------|---------|-------------------|
| ___ | ___s | yes/no | yes/no | yes/no |

## Test Results
| Test ID | Description | Result | Evidence |
|---------|-------------|--------|----------|
| TEST-RR-001 | Refresh mid-workflow | PASS/FAIL | Progress visible after refresh: yes/no |
| TEST-RR-002 | Double submit | PASS/FAIL | Duplicates created: ___ |
| TEST-RR-003 | Timeout after success | PASS/FAIL | Result preserved: yes/no |
| TEST-RR-004 | Batch checkpoint | PASS/FAIL | Resumed from item: ___ (expected: 8) |
| TEST-RR-005 | State accuracy | PASS/FAIL | ___ states inaccurate |
| TEST-RR-006 | Concurrent resume | PASS/FAIL | Duplicate processing: yes/no |
| TEST-RR-007 | Progress persistence | PASS/FAIL | Server-side progress: yes/no |
| TEST-RR-008 | External failure | PASS/FAIL | Successful items preserved: yes/no |

## Score: PASS / PARTIAL / FAIL

Priority Targeting

Run this audit FIRST if:

  • Users report "I refreshed and lost everything"
  • Jobs get stuck in "processing" and require manual DB fixes
  • Retry creates duplicates
  • Batch operations are all-or-nothing (no partial success)
  • The system has no job/task status dashboard
  • External API calls are unreliable (> 1% failure rate)

Install this skill directly: skilldb add production-audit-skills

Get CLI access →