Film & TelevisionProduction Audit442 lines

Recovery & Resume Audit

Quick Summary35 lines

Verify that interrupted workflows can restart safely without losing progress, duplicating work, or leaving the system in an inconsistent state. This audit targets the most frustrating class of production bugs: the ones where users lose work, see phantom failures, or must manually clean up after interruptions.

## Key Points

1. Identify all long-running workflows (> 5 seconds)
2. Access job/task state storage (DB table, Redis, queue)
3. Simulate network interruption (browser DevTools offline mode)
4. Simulate server-side interruption (kill worker process)
5. Observe job state transitions (logs, DB queries, admin panel)
6. Trigger workflows via API (bypassing UI debounce/guards)
1. Start a long-running workflow (e.g., generate assets for a project with 20 items).
2. Wait until approximately 5 items are complete (visible in UI or logs).
3. Hard-refresh the browser (Ctrl+Shift+R / Cmd+Shift+R).
4. Observe the page after reload.
- [ ] The workflow continues processing on the server (not cancelled by disconnect).
- [ ] The UI reflects current progress after reload (shows ~5/20 complete).

## Quick Example

```
[ ] Mutex / lock on workflow trigger (DB row lock, Redis lock, etc.)
[ ] Idempotency key on create endpoints
[ ] UI disables trigger button on click (optimistic)
[ ] Server checks for existing in-progress job before creating new one
```

```
[ ] Each item completion is individually recorded (not just batch-level)
[ ] Checkpoint stored in durable storage (DB, not just memory)
[ ] Resume logic: SELECT items WHERE batch_id = ? AND status != 'complete'
[ ] Item-level status: pending | processing | complete | failed
[ ] Batch-level rollup recalculated from item statuses
```

skilldb get production-audit-skills/recovery-resume-auditFull skill: 442 lines

Paste into your CLAUDE.md or agent config

Recovery & Resume Audit

Purpose

Verify that interrupted workflows can restart safely without losing progress, duplicating work, or leaving the system in an inconsistent state. This audit targets the most frustrating class of production bugs: the ones where users lose work, see phantom failures, or must manually clean up after interruptions.

Real users close tabs, lose network, refresh impatiently, and click retry on things that already succeeded. The system must handle all of this gracefully.

Scope

Failure Mode	What We Test
Browser refresh mid-workflow	Does progress survive? Does the UI reflect actual state?
Network interruption	Does the backend continue? Does the client recover?
Double trigger (retry)	Does the system deduplicate or create duplicates?
Server restart mid-job	Does the job resume or restart safely?
Timeout after success	Client times out but server completed; is the result visible?
Partial batch completion	7/20 items done, process dies; does it resume from item 8?
Worker crash mid-processing	Is the job retried? Is partial output cleaned up?

Risk Pattern Table

Pattern	What It Hits	Risk	Symptom
No checkpoint/bookmark	Batch jobs	HIGH	Interrupted batch restarts from item 1, reprocessing completed items
Client-side only progress	UI, UX	HIGH	Refresh loses all progress indication; user retriggers
Fire-and-forget mutations	API, Data	HIGH	Client disconnects; server may or may not have committed
Missing terminal states	State machine	HIGH	Jobs stuck in "processing" forever after worker crash
Optimistic UI without reconciliation	UI	MEDIUM	UI shows success but server failed; no correction displayed
No idempotency on retry	API, Data	HIGH	Retry creates duplicate records, charges, or notifications
Transaction spanning external calls	DB, API	HIGH	DB transaction committed but external call failed; inconsistent
Missing cleanup on failure	Storage, Data	MEDIUM	Failed job leaves orphaned files, partial records
Stale progress cache	UI, API	MEDIUM	Progress bar shows old values after resume
No distinction between "never started" and "failed"	State machine	MEDIUM	Cannot tell if job needs retry or hasn't been picked up yet

Pre-Audit Requirements

Before testing, ensure you can:

1. Identify all long-running workflows (> 5 seconds)
2. Access job/task state storage (DB table, Redis, queue)
3. Simulate network interruption (browser DevTools offline mode)
4. Simulate server-side interruption (kill worker process)
5. Observe job state transitions (logs, DB queries, admin panel)
6. Trigger workflows via API (bypassing UI debounce/guards)

Concrete Test Cases

TEST-RR-001: Refresh Midway Through Workflow

Objective: Verify that refreshing the browser during a long-running operation does not lose progress or create duplicates.

Steps:

Start a long-running workflow (e.g., generate assets for a project with 20 items).
Wait until approximately 5 items are complete (visible in UI or logs).
Hard-refresh the browser (Ctrl+Shift+R / Cmd+Shift+R).
Observe the page after reload.

Pass Criteria:

The workflow continues processing on the server (not cancelled by disconnect).
The UI reflects current progress after reload (shows ~5/20 complete).
No duplicate items are created.
The "Generate" button is disabled or shows "In Progress" (not available for re-trigger).
Completed items are accessible and correct.

Fail Criteria:

Workflow restarts from item 1.
UI shows 0/20 or no progress indicator.
Generate button is active again, inviting a duplicate trigger.
Duplicate items appear in the output.

TEST-RR-002: Trigger Workflow Twice (Double Submit)

Objective: Verify that triggering the same workflow twice does not create duplicate work.

Steps:

Start a generation/processing workflow.
Within 2 seconds, trigger it again (via API, double-click, or second tab).
Wait for completion.
Inspect results.

Pass Criteria:

Second trigger is rejected with clear message ("Already in progress").
OR second trigger is deduplicated silently (same job ID returned).
Only one set of results exists.
Only one set of external API calls was made (check provider logs/billing).
Job state is clean (not two overlapping "processing" entries).

Fail Criteria:

Two parallel jobs execute for the same work.
Duplicate results created.
Double billing on external provider.
Race condition: both jobs write to same output, corrupting it.

Implementation Check:

[ ] Mutex / lock on workflow trigger (DB row lock, Redis lock, etc.)
[ ] Idempotency key on create endpoints
[ ] UI disables trigger button on click (optimistic)
[ ] Server checks for existing in-progress job before creating new one

TEST-RR-003: Force Timeout After Server-Side Success

Objective: Verify that results are preserved when the client times out but the server completed successfully.

Steps:

Start a workflow.
Simulate client timeout: either kill the browser tab, go offline (DevTools), or set a very short client-side timeout.
Wait for the server to complete the job (monitor via logs or DB).
Re-open the page / re-request the resource.

Pass Criteria:

The completed result is visible and correct.
Job state shows "completed" (not "failed" or "timeout").
No error is displayed to the user on revisit.
The result is identical to a non-interrupted run.

Fail Criteria:

Job marked as "failed" because client disconnected.
Result exists but UI shows "error" or "unknown state".
Webhook/callback was sent to disconnected client and lost; no retry.

TEST-RR-004: Interrupt Batch at Item 7/20, Verify Resume from Item 8

Objective: Verify that batch processing supports checkpointing and resumes from the last successful item.

Steps:

Start a batch operation on exactly 20 items.
Monitor progress until item 7 completes.
Kill the worker process (simulate crash).
Restart the worker.
Observe what happens to the batch.

Pass Criteria:

Items 1-7 are marked as complete and their outputs are preserved.
Processing resumes from item 8 (not item 1).
Items 1-7 are NOT reprocessed (no duplicate external calls).
Total result after completion contains all 20 items, each processed exactly once.
Batch state accurately reflects: 7 complete, 13 remaining (then progresses to 20 complete).

Fail Criteria:

Batch restarts from item 1, reprocessing all 20 items.
Items 1-7 results are lost; batch shows 0/20.
Batch is stuck in "processing" with no auto-recovery.
Item 7 is partially written / corrupted.

Checkpoint Implementation Check:

[ ] Each item completion is individually recorded (not just batch-level)
[ ] Checkpoint stored in durable storage (DB, not just memory)
[ ] Resume logic: SELECT items WHERE batch_id = ? AND status != 'complete'
[ ] Item-level status: pending | processing | complete | failed
[ ] Batch-level rollup recalculated from item statuses

TEST-RR-005: State Accuracy After Various Interruptions

Objective: Verify that job/workflow states accurately reflect reality after interruptions.

Steps: For each state in the system, verify it is reachable and accurate:

State	How to Reach	Verification
`queued`	Submit job, check before worker picks up	Job exists in queue, no output yet
`processing`	Check during active processing	Worker is actively processing, progress updating
`completed`	Let job finish normally	All outputs exist, all items successful
`failed`	Trigger known failure (bad input, provider down)	Error recorded, partial output cleaned or marked
`partial`	Kill worker mid-batch	Some items complete, others pending/failed
`retrying`	Fail once, observe retry	Retry count incremented, previous attempt recorded
`cancelled`	Cancel during processing	No further processing occurs, partial output accessible
`timeout`	Exceed time limit	Distinguished from "processing"; retry eligible
`stuck`	Should NOT exist as valid state	Jobs in "processing" for > 2x expected duration flagged

Pass Criteria:

Every state in the table above is explicitly defined in the codebase.
Every state is reachable via a real scenario (not just theoretical).
No job can be in "processing" for longer than max_duration without being flagged.
"Partial" state exists and is distinct from "failed" and "completed".
UI accurately reflects each state with clear messaging.

TEST-RR-006: Concurrent Resume After Network Partition

Objective: Verify that a network partition does not cause two workers to process the same job.

Steps:

Start a job on Worker A.
Simulate network partition (Worker A loses DB connectivity but keeps processing).
Job visibility timeout expires; Worker B picks up the "abandoned" job.
Worker A reconnects and attempts to write results.

Pass Criteria:

Only one worker's results are committed.
The other worker detects the conflict and backs off.
No duplicate outputs.
Job state is consistent (not marked complete by both).

Implementation Check:

[ ] Job locking with lease/heartbeat (not permanent lock)
[ ] Lease timeout shorter than job timeout
[ ] Write-time version check (optimistic concurrency)
[ ] Worker checks lock ownership before writing results

TEST-RR-007: Progress Persistence Across Sessions

Objective: Verify that progress information survives session changes.

Steps:

Start a long workflow (20+ items).
Close the browser entirely.
Open a new browser session and navigate to the same page.
Check if progress is visible and accurate.
Wait for completion. Verify results are accessible.

Pass Criteria:

Progress is stored server-side (not just in browser state/localStorage).
New session shows current progress without re-triggering.
Completed items are immediately accessible.
No "stale" progress (showing old numbers from previous session).

TEST-RR-008: Graceful Degradation on External Service Failure

Objective: Verify that external service failures during a workflow result in clear, recoverable state.

Steps:

Start a workflow that depends on an external service (AI provider, storage, email).
Mid-workflow, simulate the external service going down (mock 500 errors).
Observe behavior: does the workflow retry, fail gracefully, or hang?
Restore the external service.
Retry or resume the workflow.

Pass Criteria:

Items that failed due to external service are marked with specific error (not generic "Unknown error").
Items that succeeded before the outage are preserved.
Retry targets only the failed items.
External service errors are logged with response details.
User sees actionable message: "3 items failed due to provider timeout. Retry?"

Fail Criteria:

Entire batch marked as failed, losing successful items.
Generic error with no diagnostic information.
Infinite retry loop against down service.
No way to retry just the failed items.

Recovery Architecture Checklist

CHECKPOINT STORAGE
[ ] Durable storage for job progress (database, not memory/Redis-only)
[ ] Per-item completion tracking for batch operations
[ ] Progress queryable by job ID from any server instance
[ ] Checkpoint writes are atomic (no partial checkpoint)

RETRY LOGIC
[ ] Retry count limit configured (max 3-5 retries)
[ ] Exponential backoff between retries
[ ] Jitter added to prevent thundering herd
[ ] Distinct handling: retryable errors vs permanent failures
[ ] Retry reuses idempotency key (no duplicates)

STATE TRANSITIONS
[ ] All valid states are enumerated in code (enum, not strings)
[ ] Invalid transitions are rejected (e.g., completed -> processing)
[ ] Terminal states: completed, failed, cancelled (no further transitions)
[ ] Timeout detection: jobs in "processing" beyond max_duration are flagged
[ ] "Partial" state exists for batch operations

UI RECOVERY
[ ] Polling or websocket for live progress updates
[ ] Reconnection logic when websocket drops
[ ] Page load fetches current state from server (not cache)
[ ] Clear distinction between "loading state" and "no active job"
[ ] Retry button only appears when retry is safe

State Verification Query Template

-- Find jobs stuck in non-terminal states
SELECT id, status, started_at, updated_at,
       EXTRACT(EPOCH FROM (NOW() - updated_at)) as seconds_since_update
FROM jobs
WHERE status IN ('processing', 'queued', 'retrying')
  AND updated_at < NOW() - INTERVAL '10 minutes'
ORDER BY updated_at ASC;

-- Verify batch checkpoint accuracy
SELECT batch_id,
       COUNT(*) as total_items,
       COUNT(*) FILTER (WHERE status = 'complete') as completed,
       COUNT(*) FILTER (WHERE status = 'failed') as failed,
       COUNT(*) FILTER (WHERE status = 'pending') as pending
FROM batch_items
GROUP BY batch_id
HAVING COUNT(*) FILTER (WHERE status = 'processing') > 0
   AND MAX(updated_at) < NOW() - INTERVAL '5 minutes';
-- ^ These are stuck batches: items "processing" but no updates in 5 min

Post-Audit Checklist

[ ] All long-running workflows have checkpoint/bookmark capability
[ ] Browser refresh during any workflow shows accurate progress
[ ] Double-trigger is safely deduplicated
[ ] Batch operations resume from last checkpoint after interruption
[ ] All job states are explicitly defined and reachable
[ ] Stuck job detection exists (processing > max_duration)
[ ] External service failures are isolated to affected items only
[ ] Retry targets only failed items, not the entire batch
[ ] Progress is persisted server-side, not only in client state
[ ] Error messages include enough detail for user to decide on retry

What Earlier Audits Miss

Standard testing verifies that workflows complete successfully. This audit matters because:

Happy-path tests never interrupt a workflow mid-execution. They miss that progress is stored only in memory.
Error handling tests verify that errors are caught, not that the system can resume from the error point.
Retry tests verify the retry mechanism works, not that it is idempotent and does not duplicate work.
UI tests verify rendering, not that refreshing mid-operation preserves state.
Load tests verify throughput under sustained load, not behavior when load is interrupted.

This would be called a Recovery & Resume Audit -- specifically testing whether interrupted workflows restart safely without data loss, duplication, or inconsistent state under browser refresh, network loss, server restart, and worker crash conditions.

Automation Opportunities

Test	Automatable?	Method
TEST-RR-001: Refresh mid-workflow	PARTIAL	Selenium: trigger workflow, refresh, assert progress visible
TEST-RR-002: Double submit	YES	Concurrent API requests with same payload; assert deduplication
TEST-RR-003: Timeout after success	YES	Mock slow response, kill client, verify result persists
TEST-RR-004: Batch checkpoint	YES	Start batch, kill worker at item 7, restart, assert resume from 8
TEST-RR-005: State accuracy	YES	Put entities in each state via API; verify against expected
TEST-RR-006: Concurrent resume	PARTIAL	Requires simulating network partition; complex test setup
TEST-RR-007: Progress persistence	YES	Start workflow, clear session, reopen, assert progress visible
TEST-RR-008: External failure	YES	Mock external service errors mid-batch; verify partial success

# Automated double-submit test
KEY=$(uuidgen)
curl -X POST /api/generate \
  -H "X-Idempotency-Key: $KEY" \
  -d '{"project_id": "test-123"}' &
curl -X POST /api/generate \
  -H "X-Idempotency-Key: $KEY" \
  -d '{"project_id": "test-123"}' &
wait
# Assert: only one job created in database
JOB_COUNT=$(psql -t -A -c "SELECT COUNT(*) FROM jobs WHERE project_id = 'test-123' AND status != 'cancelled'")
[ "$JOB_COUNT" -eq 1 ] && echo "PASS" || echo "FAIL: $JOB_COUNT jobs created"

Reusable Audit Report Template

# Recovery & Resume Audit Report

## System: _______________
## Date: YYYY-MM-DD
## Auditor: _______________

## Long-Running Workflows Identified
| Workflow | Duration | Checkpoint? | Resume? | Idempotent Retry? |
|----------|---------|------------|---------|-------------------|
| ___ | ___s | yes/no | yes/no | yes/no |

## Test Results
| Test ID | Description | Result | Evidence |
|---------|-------------|--------|----------|
| TEST-RR-001 | Refresh mid-workflow | PASS/FAIL | Progress visible after refresh: yes/no |
| TEST-RR-002 | Double submit | PASS/FAIL | Duplicates created: ___ |
| TEST-RR-003 | Timeout after success | PASS/FAIL | Result preserved: yes/no |
| TEST-RR-004 | Batch checkpoint | PASS/FAIL | Resumed from item: ___ (expected: 8) |
| TEST-RR-005 | State accuracy | PASS/FAIL | ___ states inaccurate |
| TEST-RR-006 | Concurrent resume | PASS/FAIL | Duplicate processing: yes/no |
| TEST-RR-007 | Progress persistence | PASS/FAIL | Server-side progress: yes/no |
| TEST-RR-008 | External failure | PASS/FAIL | Successful items preserved: yes/no |

## Score: PASS / PARTIAL / FAIL

Priority Targeting

Run this audit FIRST if:

Users report "I refreshed and lost everything"
Jobs get stuck in "processing" and require manual DB fixes
Retry creates duplicates
Batch operations are all-or-nothing (no partial success)
The system has no job/task status dashboard
External API calls are unreliable (> 1% failure rate)

Install this skill directly: skilldb add production-audit-skills

Get CLI access →

Purpose

Scope

Risk Pattern Table

Pre-Audit Requirements

Concrete Test Cases

TEST-RR-001: Refresh Midway Through Workflow

TEST-RR-002: Trigger Workflow Twice (Double Submit)

TEST-RR-003: Force Timeout After Server-Side Success

TEST-RR-004: Interrupt Batch at Item 7/20, Verify Resume from Item 8

TEST-RR-005: State Accuracy After Various Interruptions

TEST-RR-006: Concurrent Resume After Network Partition

TEST-RR-007: Progress Persistence Across Sessions

TEST-RR-008: Graceful Degradation on External Service Failure

Recovery Architecture Checklist

State Verification Query Template

Post-Audit Checklist

What Earlier Audits Miss

Automation Opportunities

Automated double-submit test

Assert: only one job created in database

Reusable Audit Report Template

Recovery & Resume Audit Report

System: _______________

Date: YYYY-MM-DD

Auditor: _______________

Long-Running Workflows Identified

Test Results

Score: PASS / PARTIAL / FAIL

Priority Targeting

Details

Pack: production-audit-skills
File: recovery-resume-audit.md
Lines: 442
Category: Film & Television

Download via CLI

Pro

$ skilldb add production-audit-skills

Installs the full Production Audit pack to your project.

Recovery & Resume Audit

Recovery & Resume Audit

Purpose

Scope

Risk Pattern Table

Pre-Audit Requirements

Concrete Test Cases

TEST-RR-001: Refresh Midway Through Workflow

TEST-RR-002: Trigger Workflow Twice (Double Submit)

TEST-RR-003: Force Timeout After Server-Side Success

TEST-RR-004: Interrupt Batch at Item 7/20, Verify Resume from Item 8

TEST-RR-005: State Accuracy After Various Interruptions

TEST-RR-006: Concurrent Resume After Network Partition

TEST-RR-007: Progress Persistence Across Sessions

TEST-RR-008: Graceful Degradation on External Service Failure

Recovery Architecture Checklist

State Verification Query Template

Post-Audit Checklist

What Earlier Audits Miss

Automation Opportunities

Reusable Audit Report Template

Priority Targeting

Related Skills

Concurrency & Race Condition Audit

Cost Explosion Audit

Data Lifecycle Audit

Human Error & Operator Safety Audit

Idempotency Audit

Observability & Debuggability Audit