Recovery & Resume Audit
Verify that interrupted workflows can restart safely without losing progress, duplicating work, or leaving the system in an inconsistent state. This audit targets the most frustrating class of production bugs: the ones where users lose work, see phantom failures, or must manually clean up after interruptions. ## Key Points 1. Identify all long-running workflows (> 5 seconds) 2. Access job/task state storage (DB table, Redis, queue) 3. Simulate network interruption (browser DevTools offline mode) 4. Simulate server-side interruption (kill worker process) 5. Observe job state transitions (logs, DB queries, admin panel) 6. Trigger workflows via API (bypassing UI debounce/guards) 1. Start a long-running workflow (e.g., generate assets for a project with 20 items). 2. Wait until approximately 5 items are complete (visible in UI or logs). 3. Hard-refresh the browser (Ctrl+Shift+R / Cmd+Shift+R). 4. Observe the page after reload. - [ ] The workflow continues processing on the server (not cancelled by disconnect). - [ ] The UI reflects current progress after reload (shows ~5/20 complete). ## Quick Example ``` [ ] Mutex / lock on workflow trigger (DB row lock, Redis lock, etc.) [ ] Idempotency key on create endpoints [ ] UI disables trigger button on click (optimistic) [ ] Server checks for existing in-progress job before creating new one ``` ``` [ ] Each item completion is individually recorded (not just batch-level) [ ] Checkpoint stored in durable storage (DB, not just memory) [ ] Resume logic: SELECT items WHERE batch_id = ? AND status != 'complete' [ ] Item-level status: pending | processing | complete | failed [ ] Batch-level rollup recalculated from item statuses ```
skilldb get production-audit-skills/recovery-resume-auditFull skill: 442 linesRecovery & Resume Audit
Purpose
Verify that interrupted workflows can restart safely without losing progress, duplicating work, or leaving the system in an inconsistent state. This audit targets the most frustrating class of production bugs: the ones where users lose work, see phantom failures, or must manually clean up after interruptions.
Real users close tabs, lose network, refresh impatiently, and click retry on things that already succeeded. The system must handle all of this gracefully.
Scope
| Failure Mode | What We Test |
|---|---|
| Browser refresh mid-workflow | Does progress survive? Does the UI reflect actual state? |
| Network interruption | Does the backend continue? Does the client recover? |
| Double trigger (retry) | Does the system deduplicate or create duplicates? |
| Server restart mid-job | Does the job resume or restart safely? |
| Timeout after success | Client times out but server completed; is the result visible? |
| Partial batch completion | 7/20 items done, process dies; does it resume from item 8? |
| Worker crash mid-processing | Is the job retried? Is partial output cleaned up? |
Risk Pattern Table
| Pattern | What It Hits | Risk | Symptom |
|---|---|---|---|
| No checkpoint/bookmark | Batch jobs | HIGH | Interrupted batch restarts from item 1, reprocessing completed items |
| Client-side only progress | UI, UX | HIGH | Refresh loses all progress indication; user retriggers |
| Fire-and-forget mutations | API, Data | HIGH | Client disconnects; server may or may not have committed |
| Missing terminal states | State machine | HIGH | Jobs stuck in "processing" forever after worker crash |
| Optimistic UI without reconciliation | UI | MEDIUM | UI shows success but server failed; no correction displayed |
| No idempotency on retry | API, Data | HIGH | Retry creates duplicate records, charges, or notifications |
| Transaction spanning external calls | DB, API | HIGH | DB transaction committed but external call failed; inconsistent |
| Missing cleanup on failure | Storage, Data | MEDIUM | Failed job leaves orphaned files, partial records |
| Stale progress cache | UI, API | MEDIUM | Progress bar shows old values after resume |
| No distinction between "never started" and "failed" | State machine | MEDIUM | Cannot tell if job needs retry or hasn't been picked up yet |
Pre-Audit Requirements
Before testing, ensure you can:
1. Identify all long-running workflows (> 5 seconds)
2. Access job/task state storage (DB table, Redis, queue)
3. Simulate network interruption (browser DevTools offline mode)
4. Simulate server-side interruption (kill worker process)
5. Observe job state transitions (logs, DB queries, admin panel)
6. Trigger workflows via API (bypassing UI debounce/guards)
Concrete Test Cases
TEST-RR-001: Refresh Midway Through Workflow
Objective: Verify that refreshing the browser during a long-running operation does not lose progress or create duplicates.
Steps:
- Start a long-running workflow (e.g., generate assets for a project with 20 items).
- Wait until approximately 5 items are complete (visible in UI or logs).
- Hard-refresh the browser (Ctrl+Shift+R / Cmd+Shift+R).
- Observe the page after reload.
Pass Criteria:
- The workflow continues processing on the server (not cancelled by disconnect).
- The UI reflects current progress after reload (shows ~5/20 complete).
- No duplicate items are created.
- The "Generate" button is disabled or shows "In Progress" (not available for re-trigger).
- Completed items are accessible and correct.
Fail Criteria:
- Workflow restarts from item 1.
- UI shows 0/20 or no progress indicator.
- Generate button is active again, inviting a duplicate trigger.
- Duplicate items appear in the output.
TEST-RR-002: Trigger Workflow Twice (Double Submit)
Objective: Verify that triggering the same workflow twice does not create duplicate work.
Steps:
- Start a generation/processing workflow.
- Within 2 seconds, trigger it again (via API, double-click, or second tab).
- Wait for completion.
- Inspect results.
Pass Criteria:
- Second trigger is rejected with clear message ("Already in progress").
- OR second trigger is deduplicated silently (same job ID returned).
- Only one set of results exists.
- Only one set of external API calls was made (check provider logs/billing).
- Job state is clean (not two overlapping "processing" entries).
Fail Criteria:
- Two parallel jobs execute for the same work.
- Duplicate results created.
- Double billing on external provider.
- Race condition: both jobs write to same output, corrupting it.
Implementation Check:
[ ] Mutex / lock on workflow trigger (DB row lock, Redis lock, etc.)
[ ] Idempotency key on create endpoints
[ ] UI disables trigger button on click (optimistic)
[ ] Server checks for existing in-progress job before creating new one
TEST-RR-003: Force Timeout After Server-Side Success
Objective: Verify that results are preserved when the client times out but the server completed successfully.
Steps:
- Start a workflow.
- Simulate client timeout: either kill the browser tab, go offline (DevTools), or set a very short client-side timeout.
- Wait for the server to complete the job (monitor via logs or DB).
- Re-open the page / re-request the resource.
Pass Criteria:
- The completed result is visible and correct.
- Job state shows "completed" (not "failed" or "timeout").
- No error is displayed to the user on revisit.
- The result is identical to a non-interrupted run.
Fail Criteria:
- Job marked as "failed" because client disconnected.
- Result exists but UI shows "error" or "unknown state".
- Webhook/callback was sent to disconnected client and lost; no retry.
TEST-RR-004: Interrupt Batch at Item 7/20, Verify Resume from Item 8
Objective: Verify that batch processing supports checkpointing and resumes from the last successful item.
Steps:
- Start a batch operation on exactly 20 items.
- Monitor progress until item 7 completes.
- Kill the worker process (simulate crash).
- Restart the worker.
- Observe what happens to the batch.
Pass Criteria:
- Items 1-7 are marked as complete and their outputs are preserved.
- Processing resumes from item 8 (not item 1).
- Items 1-7 are NOT reprocessed (no duplicate external calls).
- Total result after completion contains all 20 items, each processed exactly once.
- Batch state accurately reflects: 7 complete, 13 remaining (then progresses to 20 complete).
Fail Criteria:
- Batch restarts from item 1, reprocessing all 20 items.
- Items 1-7 results are lost; batch shows 0/20.
- Batch is stuck in "processing" with no auto-recovery.
- Item 7 is partially written / corrupted.
Checkpoint Implementation Check:
[ ] Each item completion is individually recorded (not just batch-level)
[ ] Checkpoint stored in durable storage (DB, not just memory)
[ ] Resume logic: SELECT items WHERE batch_id = ? AND status != 'complete'
[ ] Item-level status: pending | processing | complete | failed
[ ] Batch-level rollup recalculated from item statuses
TEST-RR-005: State Accuracy After Various Interruptions
Objective: Verify that job/workflow states accurately reflect reality after interruptions.
Steps: For each state in the system, verify it is reachable and accurate:
| State | How to Reach | Verification |
|---|---|---|
queued | Submit job, check before worker picks up | Job exists in queue, no output yet |
processing | Check during active processing | Worker is actively processing, progress updating |
completed | Let job finish normally | All outputs exist, all items successful |
failed | Trigger known failure (bad input, provider down) | Error recorded, partial output cleaned or marked |
partial | Kill worker mid-batch | Some items complete, others pending/failed |
retrying | Fail once, observe retry | Retry count incremented, previous attempt recorded |
cancelled | Cancel during processing | No further processing occurs, partial output accessible |
timeout | Exceed time limit | Distinguished from "processing"; retry eligible |
stuck | Should NOT exist as valid state | Jobs in "processing" for > 2x expected duration flagged |
Pass Criteria:
- Every state in the table above is explicitly defined in the codebase.
- Every state is reachable via a real scenario (not just theoretical).
- No job can be in "processing" for longer than max_duration without being flagged.
- "Partial" state exists and is distinct from "failed" and "completed".
- UI accurately reflects each state with clear messaging.
TEST-RR-006: Concurrent Resume After Network Partition
Objective: Verify that a network partition does not cause two workers to process the same job.
Steps:
- Start a job on Worker A.
- Simulate network partition (Worker A loses DB connectivity but keeps processing).
- Job visibility timeout expires; Worker B picks up the "abandoned" job.
- Worker A reconnects and attempts to write results.
Pass Criteria:
- Only one worker's results are committed.
- The other worker detects the conflict and backs off.
- No duplicate outputs.
- Job state is consistent (not marked complete by both).
Implementation Check:
[ ] Job locking with lease/heartbeat (not permanent lock)
[ ] Lease timeout shorter than job timeout
[ ] Write-time version check (optimistic concurrency)
[ ] Worker checks lock ownership before writing results
TEST-RR-007: Progress Persistence Across Sessions
Objective: Verify that progress information survives session changes.
Steps:
- Start a long workflow (20+ items).
- Close the browser entirely.
- Open a new browser session and navigate to the same page.
- Check if progress is visible and accurate.
- Wait for completion. Verify results are accessible.
Pass Criteria:
- Progress is stored server-side (not just in browser state/localStorage).
- New session shows current progress without re-triggering.
- Completed items are immediately accessible.
- No "stale" progress (showing old numbers from previous session).
TEST-RR-008: Graceful Degradation on External Service Failure
Objective: Verify that external service failures during a workflow result in clear, recoverable state.
Steps:
- Start a workflow that depends on an external service (AI provider, storage, email).
- Mid-workflow, simulate the external service going down (mock 500 errors).
- Observe behavior: does the workflow retry, fail gracefully, or hang?
- Restore the external service.
- Retry or resume the workflow.
Pass Criteria:
- Items that failed due to external service are marked with specific error (not generic "Unknown error").
- Items that succeeded before the outage are preserved.
- Retry targets only the failed items.
- External service errors are logged with response details.
- User sees actionable message: "3 items failed due to provider timeout. Retry?"
Fail Criteria:
- Entire batch marked as failed, losing successful items.
- Generic error with no diagnostic information.
- Infinite retry loop against down service.
- No way to retry just the failed items.
Recovery Architecture Checklist
CHECKPOINT STORAGE
[ ] Durable storage for job progress (database, not memory/Redis-only)
[ ] Per-item completion tracking for batch operations
[ ] Progress queryable by job ID from any server instance
[ ] Checkpoint writes are atomic (no partial checkpoint)
RETRY LOGIC
[ ] Retry count limit configured (max 3-5 retries)
[ ] Exponential backoff between retries
[ ] Jitter added to prevent thundering herd
[ ] Distinct handling: retryable errors vs permanent failures
[ ] Retry reuses idempotency key (no duplicates)
STATE TRANSITIONS
[ ] All valid states are enumerated in code (enum, not strings)
[ ] Invalid transitions are rejected (e.g., completed -> processing)
[ ] Terminal states: completed, failed, cancelled (no further transitions)
[ ] Timeout detection: jobs in "processing" beyond max_duration are flagged
[ ] "Partial" state exists for batch operations
UI RECOVERY
[ ] Polling or websocket for live progress updates
[ ] Reconnection logic when websocket drops
[ ] Page load fetches current state from server (not cache)
[ ] Clear distinction between "loading state" and "no active job"
[ ] Retry button only appears when retry is safe
State Verification Query Template
-- Find jobs stuck in non-terminal states
SELECT id, status, started_at, updated_at,
EXTRACT(EPOCH FROM (NOW() - updated_at)) as seconds_since_update
FROM jobs
WHERE status IN ('processing', 'queued', 'retrying')
AND updated_at < NOW() - INTERVAL '10 minutes'
ORDER BY updated_at ASC;
-- Verify batch checkpoint accuracy
SELECT batch_id,
COUNT(*) as total_items,
COUNT(*) FILTER (WHERE status = 'complete') as completed,
COUNT(*) FILTER (WHERE status = 'failed') as failed,
COUNT(*) FILTER (WHERE status = 'pending') as pending
FROM batch_items
GROUP BY batch_id
HAVING COUNT(*) FILTER (WHERE status = 'processing') > 0
AND MAX(updated_at) < NOW() - INTERVAL '5 minutes';
-- ^ These are stuck batches: items "processing" but no updates in 5 min
Post-Audit Checklist
[ ] All long-running workflows have checkpoint/bookmark capability
[ ] Browser refresh during any workflow shows accurate progress
[ ] Double-trigger is safely deduplicated
[ ] Batch operations resume from last checkpoint after interruption
[ ] All job states are explicitly defined and reachable
[ ] Stuck job detection exists (processing > max_duration)
[ ] External service failures are isolated to affected items only
[ ] Retry targets only failed items, not the entire batch
[ ] Progress is persisted server-side, not only in client state
[ ] Error messages include enough detail for user to decide on retry
What Earlier Audits Miss
Standard testing verifies that workflows complete successfully. This audit matters because:
- Happy-path tests never interrupt a workflow mid-execution. They miss that progress is stored only in memory.
- Error handling tests verify that errors are caught, not that the system can resume from the error point.
- Retry tests verify the retry mechanism works, not that it is idempotent and does not duplicate work.
- UI tests verify rendering, not that refreshing mid-operation preserves state.
- Load tests verify throughput under sustained load, not behavior when load is interrupted.
This would be called a Recovery & Resume Audit -- specifically testing whether interrupted workflows restart safely without data loss, duplication, or inconsistent state under browser refresh, network loss, server restart, and worker crash conditions.
Automation Opportunities
| Test | Automatable? | Method |
|---|---|---|
| TEST-RR-001: Refresh mid-workflow | PARTIAL | Selenium: trigger workflow, refresh, assert progress visible |
| TEST-RR-002: Double submit | YES | Concurrent API requests with same payload; assert deduplication |
| TEST-RR-003: Timeout after success | YES | Mock slow response, kill client, verify result persists |
| TEST-RR-004: Batch checkpoint | YES | Start batch, kill worker at item 7, restart, assert resume from 8 |
| TEST-RR-005: State accuracy | YES | Put entities in each state via API; verify against expected |
| TEST-RR-006: Concurrent resume | PARTIAL | Requires simulating network partition; complex test setup |
| TEST-RR-007: Progress persistence | YES | Start workflow, clear session, reopen, assert progress visible |
| TEST-RR-008: External failure | YES | Mock external service errors mid-batch; verify partial success |
# Automated double-submit test
KEY=$(uuidgen)
curl -X POST /api/generate \
-H "X-Idempotency-Key: $KEY" \
-d '{"project_id": "test-123"}' &
curl -X POST /api/generate \
-H "X-Idempotency-Key: $KEY" \
-d '{"project_id": "test-123"}' &
wait
# Assert: only one job created in database
JOB_COUNT=$(psql -t -A -c "SELECT COUNT(*) FROM jobs WHERE project_id = 'test-123' AND status != 'cancelled'")
[ "$JOB_COUNT" -eq 1 ] && echo "PASS" || echo "FAIL: $JOB_COUNT jobs created"
Reusable Audit Report Template
# Recovery & Resume Audit Report
## System: _______________
## Date: YYYY-MM-DD
## Auditor: _______________
## Long-Running Workflows Identified
| Workflow | Duration | Checkpoint? | Resume? | Idempotent Retry? |
|----------|---------|------------|---------|-------------------|
| ___ | ___s | yes/no | yes/no | yes/no |
## Test Results
| Test ID | Description | Result | Evidence |
|---------|-------------|--------|----------|
| TEST-RR-001 | Refresh mid-workflow | PASS/FAIL | Progress visible after refresh: yes/no |
| TEST-RR-002 | Double submit | PASS/FAIL | Duplicates created: ___ |
| TEST-RR-003 | Timeout after success | PASS/FAIL | Result preserved: yes/no |
| TEST-RR-004 | Batch checkpoint | PASS/FAIL | Resumed from item: ___ (expected: 8) |
| TEST-RR-005 | State accuracy | PASS/FAIL | ___ states inaccurate |
| TEST-RR-006 | Concurrent resume | PASS/FAIL | Duplicate processing: yes/no |
| TEST-RR-007 | Progress persistence | PASS/FAIL | Server-side progress: yes/no |
| TEST-RR-008 | External failure | PASS/FAIL | Successful items preserved: yes/no |
## Score: PASS / PARTIAL / FAIL
Priority Targeting
Run this audit FIRST if:
- Users report "I refreshed and lost everything"
- Jobs get stuck in "processing" and require manual DB fixes
- Retry creates duplicates
- Batch operations are all-or-nothing (no partial success)
- The system has no job/task status dashboard
- External API calls are unreliable (> 1% failure rate)
Install this skill directly: skilldb add production-audit-skills