Film & TelevisionProduction Audit736 lines

Reliability & Resilience Audit

Quick Summary18 lines

This is the most comprehensive audit in the production audit pack. It tests long-running, distributed workflows under real-world conditions: timeouts, partial failures, retries, queue issues, state corruption, and observability gaps. This audit encompasses and extends all other audits in the pack, providing a unified methodology for verifying production reliability.

## Key Points

1. Map every long-running workflow
2. Identify high-risk steps
3. Test timeout behavior
4. Test partial completion
5. Test retry safety
6. Test resume behavior
7. Test sequential loop safeguards
8. Test queue resilience
9. Test state accuracy
10. Test observability
1. List every user-triggerable operation that involves background processing.
2. For each, document the complete step sequence.

skilldb get production-audit-skills/reliability-resilience-auditFull skill: 736 lines

Paste into your CLAUDE.md or agent config

Reliability & Resilience Audit

Purpose

This is the most comprehensive audit in the production audit pack. It tests long-running, distributed workflows under real-world conditions: timeouts, partial failures, retries, queue issues, state corruption, and observability gaps. This audit encompasses and extends all other audits in the pack, providing a unified methodology for verifying production reliability.

A system is reliable when it produces correct results even when components fail. A system is resilient when it recovers from failures without human intervention. This audit verifies both.

The 10 Audit Tasks

This audit follows a structured 10-task methodology that covers every reliability concern in a distributed system:

Map every long-running workflow
Identify high-risk steps
Test timeout behavior
Test partial completion
Test retry safety
Test resume behavior
Test sequential loop safeguards
Test queue resilience
Test state accuracy
Test observability

Task 1: Map Every Long-Running Workflow

Objective: Create a complete inventory of every workflow that takes more than 5 seconds or involves multiple steps.

Methodology

List every user-triggerable operation that involves background processing.
For each, document the complete step sequence.
Identify external dependencies at each step.

Workflow Inventory Template

WORKFLOW: Asset Generation Pipeline
  Trigger: User clicks "Generate" on project
  Expected Duration: 30s - 5min (depending on asset count)
  Steps:
    1. Validate input (< 1s) [Internal]
    2. Check user quota (< 1s) [Database]
    3. Create job record (< 1s) [Database]
    4. Enqueue generation tasks (< 1s) [Queue]
    5. For each asset:
       5a. Call AI provider (5-30s) [External: OpenAI/Anthropic]
       5b. Upload result to storage (1-5s) [External: GCS/S3]
       5c. Generate thumbnail (1-3s) [Internal/External]
       5d. Update asset record (< 1s) [Database]
    6. Update job status (< 1s) [Database]
    7. Notify user (< 1s) [External: Email/Push]

  Total steps: 7 (+ N * 4 for N assets)
  External dependencies: AI Provider, Cloud Storage, Notification Service
  Failure modes: Provider timeout, storage failure, quota exceeded, queue down

WORKFLOW: Project Export
  Trigger: User clicks "Export Project"
  Expected Duration: 10s - 2min
  Steps:
    1. Validate permissions (< 1s) [Internal]
    2. Gather project metadata (< 1s) [Database]
    3. Download all asset files (1-60s) [External: Storage]
    4. Package into archive (1-30s) [Internal: Compute]
    5. Upload archive to temp storage (1-10s) [External: Storage]
    6. Generate download URL (< 1s) [Internal]
    7. Notify user (< 1s) [External: Email/Push]

WORKFLOW: [Document each workflow in your system]

Completeness Check

[ ] All user-triggered background operations listed
[ ] All scheduled/cron jobs listed
[ ] All webhook-triggered workflows listed
[ ] All admin-triggered operations listed
[ ] Each workflow has: step sequence, timing, dependencies
[ ] External dependencies highlighted for each step

Task 2: Identify High-Risk Steps

Objective: For each workflow, identify the steps most likely to fail and with highest impact.

Risk Assessment Matrix

Step	Failure Probability	Impact if Fails	Detection Difficulty	Risk Score
AI provider call	HIGH (timeout, rate limit, content filter)	HIGH (no output)	LOW (error returned)	CRITICAL
Storage upload	MEDIUM (transient 503)	HIGH (output lost)	LOW (error returned)	HIGH
Database write	LOW (connection issues)	HIGH (state lost)	LOW (error returned)	MEDIUM
Queue enqueue	LOW (queue down)	HIGH (job lost)	MEDIUM (may not error)	HIGH
Notification send	MEDIUM (rate limit, bounce)	LOW (not blocking)	HIGH (fire-and-forget)	MEDIUM
Thumbnail generation	MEDIUM (OOM on large files)	LOW (degraded UX)	MEDIUM (may crash silently)	MEDIUM

High-Risk Step Indicators

A step is HIGH-RISK if:
[ ] It calls an external service (network dependency)
[ ] It takes more than 10 seconds (timeout vulnerability)
[ ] It modifies state that is hard to undo (irreversible)
[ ] It processes user-provided input (unpredictable size/format)
[ ] It runs in a loop (N items = N chances to fail)
[ ] Its failure is hard to detect (fire-and-forget, no response check)
[ ] It has no retry mechanism
[ ] It has no timeout configured

Task 3: Test Timeout Behavior

Objective: Verify that every long-running step has appropriate timeout configuration and handles timeout gracefully.

Concrete Tests

TEST-RR-T01: External API Call Timeout

Steps:

Configure a mock external service that delays 120 seconds.
Trigger the workflow.
Observe: does the call timeout? After how long?
What happens after timeout?

Pass Criteria:

Timeout is configured (not waiting indefinitely).
Timeout is appropriate (30s for API calls, 300s for heavy processing).
On timeout: error is logged with specific "timeout" classification.
On timeout: job moves to retry or failed state (not stuck in "processing").
On timeout: resources are cleaned up (connections released, temp files removed).

TEST-RR-T02: Job-Level Timeout

Steps:

Start a job that will process 20 items.
Make one item hang indefinitely (mock a non-responding dependency).
Observe: does the overall job timeout?

Pass Criteria:

Job-level timeout exists (e.g., 30 minutes maximum).
Timeout is independent of individual step timeouts.
On job timeout: completed items are preserved.
On job timeout: job state changes to "timeout" or "failed".
On job timeout: alert fires.

Timeout Configuration Audit

| Component | Timeout | Default | Configured | Appropriate? |
|-----------|---------|---------|-----------|-------------|
| HTTP client (outgoing) | Connect | 5s | ? | |
| HTTP client (outgoing) | Read | 30s | ? | |
| HTTP server (incoming) | Request | 60s | ? | |
| Database query | Statement | 30s | ? | |
| Database connection | Acquire | 5s | ? | |
| Queue job | Processing | 300s | ? | |
| Queue job | Visibility | 60s | ? | |
| Worker | Heartbeat | 30s | ? | |
| Storage upload | Per-file | 120s | ? | |
| Webhook delivery | Per-attempt | 10s | ? | |

Task 4: Test Partial Completion

Objective: Verify that partial completion is handled correctly when a batch fails partway through.

Concrete Tests

TEST-RR-P01: Batch Fails at Item 7 of 20

Steps:

Start a batch of 20 items.
Cause item 7 to fail (bad input for that specific item).
Observe: do items 8-20 still process?

Pass Criteria:

Items 1-6: completed successfully, results preserved.
Item 7: marked as failed with specific error.
Items 8-20: continue processing (not blocked by item 7).
Batch status: "partial" (not "failed" for entire batch).
User can see: 19/20 succeeded, 1/20 failed.
User can retry just item 7.

TEST-RR-P02: Batch Fails at Item 7 Due to Systemic Issue

Steps:

Start a batch of 20 items.
After item 6, kill the external API (all remaining items will fail).
Observe: does the batch keep trying all 14 remaining items?

Pass Criteria:

Circuit breaker trips after 3-5 consecutive failures.
Remaining items are NOT attempted (fast-fail).
Items 1-6: completed, preserved.
Items 7-9: failed (attempted before circuit break).
Items 10-20: skipped / queued for retry (not attempted during outage).
User notification: "Generation paused: AI provider is unavailable. 6/20 complete. Will retry automatically."

Task 5: Test Retry Safety

Objective: Verify that retrying failed operations is safe and produces correct results.

Concrete Tests

TEST-RR-R01: Retry Single Failed Item

Steps:

Complete a batch with 1 failed item.
Click "Retry" on the failed item.
Verify: only the failed item is reprocessed.

Pass Criteria:

Only the failed item is sent to the external API.
Successfully completed items are NOT reprocessed.
Retry uses the same parameters as the original attempt.
On success: item moves to "completed", batch to "completed".
On failure: retry count incremented, error updated.
Max retry limit enforced (no infinite retry).

TEST-RR-R02: Retry Entire Batch

Steps:

Complete a batch with 5 failed items.
Click "Retry All Failed."
Verify: only 5 items are reprocessed.

Pass Criteria:

Exactly 5 API calls made (not 20).
Completed items untouched.
Each retried item has retry_count incremented.

TEST-RR-R03: Retry Idempotency

Steps:

Retry a failed item.
Retry the same item again before the first retry completes.

Pass Criteria:

Second retry is blocked ("Retry already in progress").
Only one retry executes.
No duplicate API calls.

Task 6: Test Resume Behavior

Objective: Verify that interrupted workflows resume from the correct point.

Concrete Tests

TEST-RR-RE01: Worker Crash Mid-Batch

Steps:

Start a 20-item batch.
Wait until 7 items complete.
Kill the worker process.
Start a new worker.
Observe: does the batch resume from item 8?

Pass Criteria:

Items 1-7 preserved (not reprocessed).
Processing resumes from item 8.
Total external API calls = 20 (not 20 + 7 duplicates).
Final result: all 20 items completed.
No user intervention required.

TEST-RR-RE02: Server Restart Mid-Request

Steps:

Send an API request for a long operation.
Restart the API server.
Client receives connection error.
Client retries the request.

Pass Criteria:

If operation was committed: retry returns existing result.
If operation was not committed: retry re-executes.
No duplicate side effects.
Client can detect server restart (health check endpoint).

Task 7: Test Sequential Loop Safeguards

Objective: Verify that operations processing items in a loop have safeguards against runaway behavior.

Concrete Tests

TEST-RR-SL01: Loop Does Not Run Forever

Steps:

Start a processing loop with a known count of items.
Verify: loop terminates after processing all items.
Test: what if the item source keeps returning items? (Pagination bug, always has_more=true)

Pass Criteria:

Loop has a maximum iteration limit (e.g., 10,000).
Loop has a maximum duration limit (e.g., 30 minutes).
Either limit triggers: stop processing, log warning, alert.
Pagination has a known endpoint (empty page = stop).

TEST-RR-SL02: Per-Item Error Isolation

Steps:

Process a loop of 20 items where item 5 throws an unexpected exception.
Observe: does item 5's error crash the entire loop?

Pass Criteria:

Item 5 is caught, logged, and marked as failed.
Items 6-20 continue processing.
The loop does not exit on the first error.
Error count is tracked; if > threshold (e.g., 50%), loop stops early.

Loop Safety Template

MAX_ITERATIONS = 10000
MAX_DURATION = timedelta(minutes=30)
MAX_CONSECUTIVE_ERRORS = 5

start_time = datetime.now()
consecutive_errors = 0
processed = 0

for item in get_items():
    # Safeguard: max iterations
    processed += 1
    if processed > MAX_ITERATIONS:
        log.error(f"Loop exceeded max iterations: {MAX_ITERATIONS}")
        break

    # Safeguard: max duration
    if datetime.now() - start_time > MAX_DURATION:
        log.error(f"Loop exceeded max duration: {MAX_DURATION}")
        break

    # Safeguard: consecutive errors
    try:
        process_item(item)
        consecutive_errors = 0  # Reset on success
    except Exception as e:
        consecutive_errors += 1
        log.error(f"Item {item.id} failed: {e}")
        record_item_failure(item, e)
        if consecutive_errors >= MAX_CONSECUTIVE_ERRORS:
            log.error(f"Circuit break: {MAX_CONSECUTIVE_ERRORS} consecutive errors")
            break

Task 8: Test Queue Resilience

Objective: Verify that the job queue handles failure modes gracefully.

Concrete Tests

TEST-RR-Q01: Queue Message Loss

Steps:

Enqueue a job.
Simulate queue restart/failure before worker picks up the job.
After queue recovers, check: is the job still in the queue?

Pass Criteria:

Queue messages are durable (persisted to disk, not just memory).
Queue recovery preserves unacknowledged messages.
No message loss during queue restart.

TEST-RR-Q02: Worker Failure to Acknowledge

Steps:

Worker picks up a job.
Worker crashes before acknowledging (completing) the job.
Observe: is the job redelivered?

Pass Criteria:

Job is redelivered after visibility timeout.
Visibility timeout is shorter than job timeout.
Redelivered job is processed idempotently (no duplicate work).
Dead letter queue catches jobs that fail repeatedly (after N redeliveries).

TEST-RR-Q03: Queue Backpressure

Steps:

Flood the queue with 1000 jobs.
Observe worker behavior.
Add more jobs while the 1000 are processing.

Pass Criteria:

Workers process at sustainable rate (not crashing from overload).
New jobs are accepted and queued (not rejected).
Queue depth is monitored and alerted.
Priority jobs are processed before bulk jobs.
Memory does not grow unbounded on workers.

Queue Architecture Audit

| Aspect | Configuration | Status |
|--------|-------------|--------|
| Persistence | [ ] Durable [ ] In-memory | |
| Delivery guarantee | [ ] At-least-once [ ] Exactly-once | |
| Visibility timeout | ___ seconds | |
| Max redeliveries | ___ attempts | |
| Dead letter queue | [ ] Configured [ ] Not configured | |
| Priority levels | [ ] Yes (___ levels) [ ] No | |
| Backpressure | [ ] Max queue depth [ ] Max concurrent | |
| Monitoring | [ ] Queue depth [ ] Processing rate [ ] Error rate | |
| Alerting | [ ] Queue growing [ ] DLQ items [ ] No workers | |

Task 9: Test State Accuracy

Objective: Verify that system state accurately reflects reality at all times, especially after failures.

Concrete Tests

TEST-RR-S01: State After Each Failure Mode

For each failure mode, verify the resulting state is accurate:

| Failure Mode | Expected State | Actual State | Accurate? |
|-------------|---------------|-------------|-----------|
| Worker crash during processing | timeout/failed (after timeout) | | [ ] |
| API call timeout | failed/retrying | | [ ] |
| Storage upload failure | failed (asset), processing (job) | | [ ] |
| Queue message lost | stuck detection catches it | | [ ] |
| Partial batch completion | partial (job), mix of complete/failed (items) | | [ ] |
| User cancellation during processing | cancelled (job + remaining items) | | [ ] |
| Server restart during request | queued/failed (depends on commit point) | | [ ] |

TEST-RR-S02: No "Impossible" States After Failure

Steps:

After each failure test above, check for impossible states.
Query the database for anomalies.

Impossible State Queries:

-- Jobs stuck in processing (should have timed out)
SELECT * FROM jobs
WHERE status = 'processing'
AND updated_at < NOW() - INTERVAL '2 hours';

-- Completed jobs with no output
SELECT * FROM jobs
WHERE status = 'completed'
AND NOT EXISTS (SELECT 1 FROM assets WHERE job_id = jobs.id AND status = 'ready');

-- Items processing with no active worker
SELECT * FROM job_items
WHERE status = 'processing'
AND worker_id NOT IN (SELECT id FROM workers WHERE last_heartbeat > NOW() - INTERVAL '5 minutes');

-- Batch "completed" but has failed items
SELECT j.id, j.status,
  COUNT(*) FILTER (WHERE ji.status = 'failed') as failed_count
FROM jobs j JOIN job_items ji ON ji.job_id = j.id
WHERE j.status = 'completed'
GROUP BY j.id, j.status
HAVING COUNT(*) FILTER (WHERE ji.status = 'failed') > 0;

Task 10: Test Observability

Objective: Verify that every failure mode produces sufficient diagnostic information.

Concrete Tests

TEST-RR-O01: Failure Diagnosis Time

For each failure mode tested above, measure: how long would it take an engineer to diagnose the root cause using only production logs and metrics?

Target: < 15 minutes from alert to root cause identification.

| Failure Mode | Alert Fired? | Time to Find in Logs | Root Cause Identifiable? | Diagnosis Time |
|-------------|-------------|---------------------|------------------------|----------------|
| Worker crash | [ ] | ___ min | [ ] | ___ min |
| API timeout | [ ] | ___ min | [ ] | ___ min |
| Storage failure | [ ] | ___ min | [ ] | ___ min |
| Queue issue | [ ] | ___ min | [ ] | ___ min |
| State corruption | [ ] | ___ min | [ ] | ___ min |

TEST-RR-O02: End-to-End Trace Completeness

Steps:

Trigger a workflow that touches every component.
Using only the correlation ID, reconstruct the entire flow from logs.

Pass Criteria:

Every step is visible in logs.
Correlation ID links all entries.
External call details (request, response, duration) are logged.
Failure point is unambiguous.
Duration breakdown is available per step.

Full Risk Patterns Table

#	Pattern	Category	Risk	Detection	Mitigation
1	No timeout on external calls	Timeout	CRITICAL	Stuck jobs	Configure per-call timeout
2	No job-level timeout	Timeout	HIGH	Stuck jobs	Max duration per job type
3	Entire batch fails on single item error	Partial	HIGH	Lost work	Per-item error isolation
4	Retry reprocesses all items	Retry	HIGH	Wasted cost	Checkpoint + selective retry
5	Retry without idempotency	Retry	HIGH	Duplicates	Idempotency keys
6	Infinite retry loop	Retry	CRITICAL	Cost explosion	Max retry + circuit breaker
7	No checkpoint in batch processing	Resume	HIGH	Lost progress	Per-item completion tracking
8	Worker crash loses progress	Resume	HIGH	Wasted work	Durable checkpoints
9	Sequential loop without bounds	Loop	HIGH	Runaway process	Max iterations + duration
10	No error isolation in loops	Loop	HIGH	Cascading failure	Try-catch per item
11	Queue message loss	Queue	HIGH	Lost jobs	Durable queue + monitoring
12	No dead letter queue	Queue	MEDIUM	Silent failure	DLQ + alerting
13	No visibility timeout	Queue	HIGH	Duplicate processing	Appropriate timeout config
14	Stuck jobs undetected	State	HIGH	Manual DB fixes	Stuck detection + auto-timeout
15	Impossible states possible	State	HIGH	Data corruption	Transition validation + constraints
16	No correlation ID in logs	Observability	HIGH	Slow debugging	Middleware-generated trace ID
17	Generic error messages	Observability	MEDIUM	Slow debugging	Specific errors with context
18	No per-step timing	Observability	MEDIUM	Can't find bottleneck	Duration logging per step
19	Missing external call logging	Observability	HIGH	Blind to provider issues	Request/response/duration logging
20	No alerting on critical failures	Observability	HIGH	Silent incidents	Alert rules on error rate + stuck jobs

Pass Criteria Summary

CRITICAL (Must pass for production):
[ ] Every external call has a timeout
[ ] Every job has a maximum duration
[ ] Batch processing isolates per-item errors
[ ] Retry is safe (idempotent, bounded, checkpointed)
[ ] Queue messages are durable
[ ] Stuck jobs are detected and auto-resolved
[ ] Correlation IDs trace requests end-to-end

HIGH (Should pass for reliability):
[ ] Resume from checkpoint after worker crash
[ ] Circuit breaker on cascading external failures
[ ] Sequential loops have bounds (max iterations, max duration)
[ ] Dead letter queue configured with alerting
[ ] Impossible states prevented by DB constraints
[ ] Per-step timing logged for performance diagnosis
[ ] Alerting on all critical failure modes

MEDIUM (Recommended for operational maturity):
[ ] Partial completion state exists (not binary success/fail)
[ ] Priority queue levels for interactive vs bulk jobs
[ ] Health check endpoint tests real dependencies
[ ] Runbook exists for each alert
[ ] Error messages include entity IDs and provider responses

Priority Targeting Methodology

Assess Your System

Answer these questions to prioritize which tasks to run first:

1. Do you have long-running workflows (> 30s)?
   YES -> Start with Task 1 (Map Workflows) + Task 3 (Timeouts)

2. Do workflows call external paid APIs?
   YES -> Prioritize Task 5 (Retry Safety) + Task 7 (Loop Safeguards)

3. Have users reported stuck/lost jobs?
   YES -> Prioritize Task 9 (State Accuracy) + Task 6 (Resume)

4. Is debugging production issues slow (> 30 min)?
   YES -> Prioritize Task 10 (Observability)

5. Do batch operations fail entirely on single-item errors?
   YES -> Prioritize Task 4 (Partial Completion) + Task 7 (Loop Safeguards)

6. Are jobs processed by a queue system?
   YES -> Prioritize Task 8 (Queue Resilience)

System Type Prioritization

AI/Media Pipeline:
  1. Timeout behavior (Task 3) -- expensive external calls can hang
  2. Retry safety (Task 5) -- retries cost real money
  3. Partial completion (Task 4) -- large batches must not restart from zero
  4. Observability (Task 10) -- must track provider call details

SaaS Platform:
  1. State accuracy (Task 9) -- user-facing state must be correct
  2. Queue resilience (Task 8) -- job processing is core business logic
  3. Resume behavior (Task 6) -- users expect progress to survive failures
  4. Observability (Task 10) -- multi-tenant debugging requires correlation

E-Commerce:
  1. Retry safety (Task 5) -- payment and inventory operations must be idempotent
  2. State accuracy (Task 9) -- order state drives fulfillment
  3. Timeout behavior (Task 3) -- payment gateway timeouts are common
  4. Queue resilience (Task 8) -- order processing queue is critical path

Execution Methodology

Phase 1: Discovery (1-2 days)

Complete Task 1 (Map Workflows)
Complete Task 2 (Identify High-Risk Steps)
Create risk-prioritized test plan

Phase 2: Core Testing (2-3 days)

Task 3: Timeout Behavior
Task 5: Retry Safety
Task 9: State Accuracy
Task 10: Observability

Phase 3: Deep Testing (2-3 days)

Task 4: Partial Completion
Task 6: Resume Behavior
Task 7: Sequential Loop Safeguards
Task 8: Queue Resilience

Phase 4: Remediation (ongoing)

Fix CRITICAL findings immediately
Schedule HIGH findings for next sprint
Track MEDIUM findings in backlog

What Earlier Audits Miss

Standard reliability testing verifies that the system handles known error cases. This audit matters because:

Unit tests test individual components in isolation. They never test what happens when a component fails while another depends on it.
Integration tests run in controlled environments. They never simulate network partitions, worker crashes, or queue message loss.
Load tests verify performance under sustained throughput. They do not inject faults during load.
The other 10 audits in this pack each cover a specific concern. This audit ties them together and tests the interactions between them.
Chaos engineering is often skipped because teams consider it "too risky for staging." This audit provides structured, safe fault injection tests.

This would be called a Reliability & Resilience Audit -- specifically testing whether the system produces correct results and recovers automatically under component failures, network issues, timeout conditions, and queue problems.

Automation Opportunities

Test	Automatable?	Method
TEST-RR-T01: API timeout	YES	Mock slow external service; assert timeout and cleanup
TEST-RR-T02: Job timeout	YES	Inject hanging dependency; assert job-level timeout fires
TEST-RR-P01: Batch partial failure	YES	Inject failure at item N; assert partial success state
TEST-RR-P02: Systemic failure	YES	Kill mock API mid-batch; assert circuit breaker
TEST-RR-R01: Retry single	YES	Fail one item, retry, assert only that item reprocessed
TEST-RR-R03: Retry idempotency	YES	Double-retry; assert single execution
TEST-RR-RE01: Worker crash	YES	Kill worker process; restart; assert resume from checkpoint
TEST-RR-SL01: Loop bounds	YES	Feed infinite pagination; assert loop terminates
TEST-RR-Q01: Queue durability	YES	Restart queue; assert messages preserved
TEST-RR-S01: State accuracy	YES	Run failure scenarios; query DB for impossible states
TEST-RR-O01: Diagnosis time	MANUAL	Simulate incident; measure time to root cause

Reusable Audit Report Template

# Reliability & Resilience Audit Report

## System: _______________
## Date: YYYY-MM-DD
## Auditor: _______________

## Workflow Inventory (Task 1)
| Workflow | Steps | Duration | External Dependencies | Risk |
|----------|-------|---------|----------------------|------|
| ___ | ___ | ___s | ___ | HIGH/MEDIUM/LOW |

## Test Results by Task
| Task | Tests Run | Passed | Failed | Critical Findings |
|------|----------|--------|--------|-------------------|
| 3. Timeouts | ___ | ___ | ___ | ___ |
| 4. Partial completion | ___ | ___ | ___ | ___ |
| 5. Retry safety | ___ | ___ | ___ | ___ |
| 6. Resume | ___ | ___ | ___ | ___ |
| 7. Loop safeguards | ___ | ___ | ___ | ___ |
| 8. Queue resilience | ___ | ___ | ___ | ___ |
| 9. State accuracy | ___ | ___ | ___ | ___ |
| 10. Observability | ___ | ___ | ___ | ___ |

## Overall Score: PASS / PARTIAL / FAIL

Post-Audit Deliverables

1. Workflow inventory (Task 1 output)
2. Risk assessment matrix (Task 2 output)
3. Test results per task (pass/partial/fail with evidence)
4. Finding severity classification (CRITICAL/HIGH/MEDIUM)
5. Remediation recommendations with effort estimates
6. Retest plan (verify fixes)

Install this skill directly: skilldb add production-audit-skills

Get CLI access →

Purpose

The 10 Audit Tasks

Task 1: Map Every Long-Running Workflow

Methodology

Workflow Inventory Template

Completeness Check

Task 2: Identify High-Risk Steps

Risk Assessment Matrix

High-Risk Step Indicators

Task 3: Test Timeout Behavior

Concrete Tests

TEST-RR-T01: External API Call Timeout

TEST-RR-T02: Job-Level Timeout

Timeout Configuration Audit

Task 4: Test Partial Completion

Concrete Tests

TEST-RR-P01: Batch Fails at Item 7 of 20

TEST-RR-P02: Batch Fails at Item 7 Due to Systemic Issue

Task 5: Test Retry Safety

Concrete Tests

TEST-RR-R01: Retry Single Failed Item

TEST-RR-R02: Retry Entire Batch

TEST-RR-R03: Retry Idempotency

Task 6: Test Resume Behavior

Concrete Tests

TEST-RR-RE01: Worker Crash Mid-Batch

TEST-RR-RE02: Server Restart Mid-Request

Task 7: Test Sequential Loop Safeguards

Concrete Tests

TEST-RR-SL01: Loop Does Not Run Forever

TEST-RR-SL02: Per-Item Error Isolation

Loop Safety Template

Task 8: Test Queue Resilience

Concrete Tests

TEST-RR-Q01: Queue Message Loss

TEST-RR-Q02: Worker Failure to Acknowledge

TEST-RR-Q03: Queue Backpressure

Queue Architecture Audit

Task 9: Test State Accuracy

Concrete Tests

TEST-RR-S01: State After Each Failure Mode

TEST-RR-S02: No "Impossible" States After Failure

Task 10: Test Observability

Concrete Tests

TEST-RR-O01: Failure Diagnosis Time

TEST-RR-O02: End-to-End Trace Completeness

Full Risk Patterns Table

Pass Criteria Summary

Priority Targeting Methodology

Assess Your System

System Type Prioritization

Execution Methodology

Phase 1: Discovery (1-2 days)

Phase 2: Core Testing (2-3 days)

Phase 3: Deep Testing (2-3 days)

Phase 4: Remediation (ongoing)

What Earlier Audits Miss

Automation Opportunities

Reusable Audit Report Template

Reliability & Resilience Audit Report

System: _______________

Date: YYYY-MM-DD

Auditor: _______________

Workflow Inventory (Task 1)

Test Results by Task

Overall Score: PASS / PARTIAL / FAIL

Post-Audit Deliverables

Details

Pack: production-audit-skills
File: reliability-resilience-audit.md
Lines: 736
Category: Film & Television

Download via CLI

Pro

$ skilldb add production-audit-skills

Installs the full Production Audit pack to your project.

Reliability & Resilience Audit

Reliability & Resilience Audit

Purpose

The 10 Audit Tasks

Task 1: Map Every Long-Running Workflow

Methodology

Workflow Inventory Template

Completeness Check

Task 2: Identify High-Risk Steps

Risk Assessment Matrix

High-Risk Step Indicators

Task 3: Test Timeout Behavior

Concrete Tests

TEST-RR-T01: External API Call Timeout

TEST-RR-T02: Job-Level Timeout

Timeout Configuration Audit

Task 4: Test Partial Completion

Concrete Tests

TEST-RR-P01: Batch Fails at Item 7 of 20

TEST-RR-P02: Batch Fails at Item 7 Due to Systemic Issue

Task 5: Test Retry Safety

Concrete Tests

TEST-RR-R01: Retry Single Failed Item

TEST-RR-R02: Retry Entire Batch

TEST-RR-R03: Retry Idempotency

Task 6: Test Resume Behavior

Concrete Tests

TEST-RR-RE01: Worker Crash Mid-Batch

TEST-RR-RE02: Server Restart Mid-Request

Task 7: Test Sequential Loop Safeguards

Concrete Tests

TEST-RR-SL01: Loop Does Not Run Forever

TEST-RR-SL02: Per-Item Error Isolation

Loop Safety Template

Task 8: Test Queue Resilience

Concrete Tests

TEST-RR-Q01: Queue Message Loss

TEST-RR-Q02: Worker Failure to Acknowledge

TEST-RR-Q03: Queue Backpressure

Queue Architecture Audit

Task 9: Test State Accuracy

Concrete Tests

TEST-RR-S01: State After Each Failure Mode

TEST-RR-S02: No "Impossible" States After Failure

Task 10: Test Observability

Concrete Tests

TEST-RR-O01: Failure Diagnosis Time

TEST-RR-O02: End-to-End Trace Completeness

Full Risk Patterns Table

Pass Criteria Summary

Priority Targeting Methodology

Assess Your System

System Type Prioritization

Execution Methodology

Phase 1: Discovery (1-2 days)

Phase 2: Core Testing (2-3 days)

Phase 3: Deep Testing (2-3 days)

Phase 4: Remediation (ongoing)

What Earlier Audits Miss

Automation Opportunities

Reusable Audit Report Template

Post-Audit Deliverables

Related Skills

Concurrency & Race Condition Audit

Cost Explosion Audit

Data Lifecycle Audit

Human Error & Operator Safety Audit

Idempotency Audit

Observability & Debuggability Audit