Skip to main content
Film & TelevisionProduction Audit736 lines

Reliability & Resilience Audit

Quick Summary18 lines
This is the most comprehensive audit in the production audit pack. It tests long-running, distributed workflows under real-world conditions: timeouts, partial failures, retries, queue issues, state corruption, and observability gaps. This audit encompasses and extends all other audits in the pack, providing a unified methodology for verifying production reliability.

## Key Points

1. Map every long-running workflow
2. Identify high-risk steps
3. Test timeout behavior
4. Test partial completion
5. Test retry safety
6. Test resume behavior
7. Test sequential loop safeguards
8. Test queue resilience
9. Test state accuracy
10. Test observability
1. List every user-triggerable operation that involves background processing.
2. For each, document the complete step sequence.
skilldb get production-audit-skills/reliability-resilience-auditFull skill: 736 lines
Paste into your CLAUDE.md or agent config

Reliability & Resilience Audit

Purpose

This is the most comprehensive audit in the production audit pack. It tests long-running, distributed workflows under real-world conditions: timeouts, partial failures, retries, queue issues, state corruption, and observability gaps. This audit encompasses and extends all other audits in the pack, providing a unified methodology for verifying production reliability.

A system is reliable when it produces correct results even when components fail. A system is resilient when it recovers from failures without human intervention. This audit verifies both.


The 10 Audit Tasks

This audit follows a structured 10-task methodology that covers every reliability concern in a distributed system:

  1. Map every long-running workflow
  2. Identify high-risk steps
  3. Test timeout behavior
  4. Test partial completion
  5. Test retry safety
  6. Test resume behavior
  7. Test sequential loop safeguards
  8. Test queue resilience
  9. Test state accuracy
  10. Test observability

Task 1: Map Every Long-Running Workflow

Objective: Create a complete inventory of every workflow that takes more than 5 seconds or involves multiple steps.

Methodology

  1. List every user-triggerable operation that involves background processing.
  2. For each, document the complete step sequence.
  3. Identify external dependencies at each step.

Workflow Inventory Template

WORKFLOW: Asset Generation Pipeline
  Trigger: User clicks "Generate" on project
  Expected Duration: 30s - 5min (depending on asset count)
  Steps:
    1. Validate input (< 1s) [Internal]
    2. Check user quota (< 1s) [Database]
    3. Create job record (< 1s) [Database]
    4. Enqueue generation tasks (< 1s) [Queue]
    5. For each asset:
       5a. Call AI provider (5-30s) [External: OpenAI/Anthropic]
       5b. Upload result to storage (1-5s) [External: GCS/S3]
       5c. Generate thumbnail (1-3s) [Internal/External]
       5d. Update asset record (< 1s) [Database]
    6. Update job status (< 1s) [Database]
    7. Notify user (< 1s) [External: Email/Push]

  Total steps: 7 (+ N * 4 for N assets)
  External dependencies: AI Provider, Cloud Storage, Notification Service
  Failure modes: Provider timeout, storage failure, quota exceeded, queue down

WORKFLOW: Project Export
  Trigger: User clicks "Export Project"
  Expected Duration: 10s - 2min
  Steps:
    1. Validate permissions (< 1s) [Internal]
    2. Gather project metadata (< 1s) [Database]
    3. Download all asset files (1-60s) [External: Storage]
    4. Package into archive (1-30s) [Internal: Compute]
    5. Upload archive to temp storage (1-10s) [External: Storage]
    6. Generate download URL (< 1s) [Internal]
    7. Notify user (< 1s) [External: Email/Push]

WORKFLOW: [Document each workflow in your system]

Completeness Check

[ ] All user-triggered background operations listed
[ ] All scheduled/cron jobs listed
[ ] All webhook-triggered workflows listed
[ ] All admin-triggered operations listed
[ ] Each workflow has: step sequence, timing, dependencies
[ ] External dependencies highlighted for each step

Task 2: Identify High-Risk Steps

Objective: For each workflow, identify the steps most likely to fail and with highest impact.

Risk Assessment Matrix

StepFailure ProbabilityImpact if FailsDetection DifficultyRisk Score
AI provider callHIGH (timeout, rate limit, content filter)HIGH (no output)LOW (error returned)CRITICAL
Storage uploadMEDIUM (transient 503)HIGH (output lost)LOW (error returned)HIGH
Database writeLOW (connection issues)HIGH (state lost)LOW (error returned)MEDIUM
Queue enqueueLOW (queue down)HIGH (job lost)MEDIUM (may not error)HIGH
Notification sendMEDIUM (rate limit, bounce)LOW (not blocking)HIGH (fire-and-forget)MEDIUM
Thumbnail generationMEDIUM (OOM on large files)LOW (degraded UX)MEDIUM (may crash silently)MEDIUM

High-Risk Step Indicators

A step is HIGH-RISK if:
[ ] It calls an external service (network dependency)
[ ] It takes more than 10 seconds (timeout vulnerability)
[ ] It modifies state that is hard to undo (irreversible)
[ ] It processes user-provided input (unpredictable size/format)
[ ] It runs in a loop (N items = N chances to fail)
[ ] Its failure is hard to detect (fire-and-forget, no response check)
[ ] It has no retry mechanism
[ ] It has no timeout configured

Task 3: Test Timeout Behavior

Objective: Verify that every long-running step has appropriate timeout configuration and handles timeout gracefully.

Concrete Tests

TEST-RR-T01: External API Call Timeout

Steps:

  1. Configure a mock external service that delays 120 seconds.
  2. Trigger the workflow.
  3. Observe: does the call timeout? After how long?
  4. What happens after timeout?

Pass Criteria:

  • Timeout is configured (not waiting indefinitely).
  • Timeout is appropriate (30s for API calls, 300s for heavy processing).
  • On timeout: error is logged with specific "timeout" classification.
  • On timeout: job moves to retry or failed state (not stuck in "processing").
  • On timeout: resources are cleaned up (connections released, temp files removed).

TEST-RR-T02: Job-Level Timeout

Steps:

  1. Start a job that will process 20 items.
  2. Make one item hang indefinitely (mock a non-responding dependency).
  3. Observe: does the overall job timeout?

Pass Criteria:

  • Job-level timeout exists (e.g., 30 minutes maximum).
  • Timeout is independent of individual step timeouts.
  • On job timeout: completed items are preserved.
  • On job timeout: job state changes to "timeout" or "failed".
  • On job timeout: alert fires.

Timeout Configuration Audit

| Component | Timeout | Default | Configured | Appropriate? |
|-----------|---------|---------|-----------|-------------|
| HTTP client (outgoing) | Connect | 5s | ? | |
| HTTP client (outgoing) | Read | 30s | ? | |
| HTTP server (incoming) | Request | 60s | ? | |
| Database query | Statement | 30s | ? | |
| Database connection | Acquire | 5s | ? | |
| Queue job | Processing | 300s | ? | |
| Queue job | Visibility | 60s | ? | |
| Worker | Heartbeat | 30s | ? | |
| Storage upload | Per-file | 120s | ? | |
| Webhook delivery | Per-attempt | 10s | ? | |

Task 4: Test Partial Completion

Objective: Verify that partial completion is handled correctly when a batch fails partway through.

Concrete Tests

TEST-RR-P01: Batch Fails at Item 7 of 20

Steps:

  1. Start a batch of 20 items.
  2. Cause item 7 to fail (bad input for that specific item).
  3. Observe: do items 8-20 still process?

Pass Criteria:

  • Items 1-6: completed successfully, results preserved.
  • Item 7: marked as failed with specific error.
  • Items 8-20: continue processing (not blocked by item 7).
  • Batch status: "partial" (not "failed" for entire batch).
  • User can see: 19/20 succeeded, 1/20 failed.
  • User can retry just item 7.

TEST-RR-P02: Batch Fails at Item 7 Due to Systemic Issue

Steps:

  1. Start a batch of 20 items.
  2. After item 6, kill the external API (all remaining items will fail).
  3. Observe: does the batch keep trying all 14 remaining items?

Pass Criteria:

  • Circuit breaker trips after 3-5 consecutive failures.
  • Remaining items are NOT attempted (fast-fail).
  • Items 1-6: completed, preserved.
  • Items 7-9: failed (attempted before circuit break).
  • Items 10-20: skipped / queued for retry (not attempted during outage).
  • User notification: "Generation paused: AI provider is unavailable. 6/20 complete. Will retry automatically."

Task 5: Test Retry Safety

Objective: Verify that retrying failed operations is safe and produces correct results.

Concrete Tests

TEST-RR-R01: Retry Single Failed Item

Steps:

  1. Complete a batch with 1 failed item.
  2. Click "Retry" on the failed item.
  3. Verify: only the failed item is reprocessed.

Pass Criteria:

  • Only the failed item is sent to the external API.
  • Successfully completed items are NOT reprocessed.
  • Retry uses the same parameters as the original attempt.
  • On success: item moves to "completed", batch to "completed".
  • On failure: retry count incremented, error updated.
  • Max retry limit enforced (no infinite retry).

TEST-RR-R02: Retry Entire Batch

Steps:

  1. Complete a batch with 5 failed items.
  2. Click "Retry All Failed."
  3. Verify: only 5 items are reprocessed.

Pass Criteria:

  • Exactly 5 API calls made (not 20).
  • Completed items untouched.
  • Each retried item has retry_count incremented.

TEST-RR-R03: Retry Idempotency

Steps:

  1. Retry a failed item.
  2. Retry the same item again before the first retry completes.

Pass Criteria:

  • Second retry is blocked ("Retry already in progress").
  • Only one retry executes.
  • No duplicate API calls.

Task 6: Test Resume Behavior

Objective: Verify that interrupted workflows resume from the correct point.

Concrete Tests

TEST-RR-RE01: Worker Crash Mid-Batch

Steps:

  1. Start a 20-item batch.
  2. Wait until 7 items complete.
  3. Kill the worker process.
  4. Start a new worker.
  5. Observe: does the batch resume from item 8?

Pass Criteria:

  • Items 1-7 preserved (not reprocessed).
  • Processing resumes from item 8.
  • Total external API calls = 20 (not 20 + 7 duplicates).
  • Final result: all 20 items completed.
  • No user intervention required.

TEST-RR-RE02: Server Restart Mid-Request

Steps:

  1. Send an API request for a long operation.
  2. Restart the API server.
  3. Client receives connection error.
  4. Client retries the request.

Pass Criteria:

  • If operation was committed: retry returns existing result.
  • If operation was not committed: retry re-executes.
  • No duplicate side effects.
  • Client can detect server restart (health check endpoint).

Task 7: Test Sequential Loop Safeguards

Objective: Verify that operations processing items in a loop have safeguards against runaway behavior.

Concrete Tests

TEST-RR-SL01: Loop Does Not Run Forever

Steps:

  1. Start a processing loop with a known count of items.
  2. Verify: loop terminates after processing all items.
  3. Test: what if the item source keeps returning items? (Pagination bug, always has_more=true)

Pass Criteria:

  • Loop has a maximum iteration limit (e.g., 10,000).
  • Loop has a maximum duration limit (e.g., 30 minutes).
  • Either limit triggers: stop processing, log warning, alert.
  • Pagination has a known endpoint (empty page = stop).

TEST-RR-SL02: Per-Item Error Isolation

Steps:

  1. Process a loop of 20 items where item 5 throws an unexpected exception.
  2. Observe: does item 5's error crash the entire loop?

Pass Criteria:

  • Item 5 is caught, logged, and marked as failed.
  • Items 6-20 continue processing.
  • The loop does not exit on the first error.
  • Error count is tracked; if > threshold (e.g., 50%), loop stops early.

Loop Safety Template

MAX_ITERATIONS = 10000
MAX_DURATION = timedelta(minutes=30)
MAX_CONSECUTIVE_ERRORS = 5

start_time = datetime.now()
consecutive_errors = 0
processed = 0

for item in get_items():
    # Safeguard: max iterations
    processed += 1
    if processed > MAX_ITERATIONS:
        log.error(f"Loop exceeded max iterations: {MAX_ITERATIONS}")
        break

    # Safeguard: max duration
    if datetime.now() - start_time > MAX_DURATION:
        log.error(f"Loop exceeded max duration: {MAX_DURATION}")
        break

    # Safeguard: consecutive errors
    try:
        process_item(item)
        consecutive_errors = 0  # Reset on success
    except Exception as e:
        consecutive_errors += 1
        log.error(f"Item {item.id} failed: {e}")
        record_item_failure(item, e)
        if consecutive_errors >= MAX_CONSECUTIVE_ERRORS:
            log.error(f"Circuit break: {MAX_CONSECUTIVE_ERRORS} consecutive errors")
            break

Task 8: Test Queue Resilience

Objective: Verify that the job queue handles failure modes gracefully.

Concrete Tests

TEST-RR-Q01: Queue Message Loss

Steps:

  1. Enqueue a job.
  2. Simulate queue restart/failure before worker picks up the job.
  3. After queue recovers, check: is the job still in the queue?

Pass Criteria:

  • Queue messages are durable (persisted to disk, not just memory).
  • Queue recovery preserves unacknowledged messages.
  • No message loss during queue restart.

TEST-RR-Q02: Worker Failure to Acknowledge

Steps:

  1. Worker picks up a job.
  2. Worker crashes before acknowledging (completing) the job.
  3. Observe: is the job redelivered?

Pass Criteria:

  • Job is redelivered after visibility timeout.
  • Visibility timeout is shorter than job timeout.
  • Redelivered job is processed idempotently (no duplicate work).
  • Dead letter queue catches jobs that fail repeatedly (after N redeliveries).

TEST-RR-Q03: Queue Backpressure

Steps:

  1. Flood the queue with 1000 jobs.
  2. Observe worker behavior.
  3. Add more jobs while the 1000 are processing.

Pass Criteria:

  • Workers process at sustainable rate (not crashing from overload).
  • New jobs are accepted and queued (not rejected).
  • Queue depth is monitored and alerted.
  • Priority jobs are processed before bulk jobs.
  • Memory does not grow unbounded on workers.

Queue Architecture Audit

| Aspect | Configuration | Status |
|--------|-------------|--------|
| Persistence | [ ] Durable [ ] In-memory | |
| Delivery guarantee | [ ] At-least-once [ ] Exactly-once | |
| Visibility timeout | ___ seconds | |
| Max redeliveries | ___ attempts | |
| Dead letter queue | [ ] Configured [ ] Not configured | |
| Priority levels | [ ] Yes (___ levels) [ ] No | |
| Backpressure | [ ] Max queue depth [ ] Max concurrent | |
| Monitoring | [ ] Queue depth [ ] Processing rate [ ] Error rate | |
| Alerting | [ ] Queue growing [ ] DLQ items [ ] No workers | |

Task 9: Test State Accuracy

Objective: Verify that system state accurately reflects reality at all times, especially after failures.

Concrete Tests

TEST-RR-S01: State After Each Failure Mode

For each failure mode, verify the resulting state is accurate:

| Failure Mode | Expected State | Actual State | Accurate? |
|-------------|---------------|-------------|-----------|
| Worker crash during processing | timeout/failed (after timeout) | | [ ] |
| API call timeout | failed/retrying | | [ ] |
| Storage upload failure | failed (asset), processing (job) | | [ ] |
| Queue message lost | stuck detection catches it | | [ ] |
| Partial batch completion | partial (job), mix of complete/failed (items) | | [ ] |
| User cancellation during processing | cancelled (job + remaining items) | | [ ] |
| Server restart during request | queued/failed (depends on commit point) | | [ ] |

TEST-RR-S02: No "Impossible" States After Failure

Steps:

  1. After each failure test above, check for impossible states.
  2. Query the database for anomalies.

Impossible State Queries:

-- Jobs stuck in processing (should have timed out)
SELECT * FROM jobs
WHERE status = 'processing'
AND updated_at < NOW() - INTERVAL '2 hours';

-- Completed jobs with no output
SELECT * FROM jobs
WHERE status = 'completed'
AND NOT EXISTS (SELECT 1 FROM assets WHERE job_id = jobs.id AND status = 'ready');

-- Items processing with no active worker
SELECT * FROM job_items
WHERE status = 'processing'
AND worker_id NOT IN (SELECT id FROM workers WHERE last_heartbeat > NOW() - INTERVAL '5 minutes');

-- Batch "completed" but has failed items
SELECT j.id, j.status,
  COUNT(*) FILTER (WHERE ji.status = 'failed') as failed_count
FROM jobs j JOIN job_items ji ON ji.job_id = j.id
WHERE j.status = 'completed'
GROUP BY j.id, j.status
HAVING COUNT(*) FILTER (WHERE ji.status = 'failed') > 0;

Task 10: Test Observability

Objective: Verify that every failure mode produces sufficient diagnostic information.

Concrete Tests

TEST-RR-O01: Failure Diagnosis Time

For each failure mode tested above, measure: how long would it take an engineer to diagnose the root cause using only production logs and metrics?

Target: < 15 minutes from alert to root cause identification.

| Failure Mode | Alert Fired? | Time to Find in Logs | Root Cause Identifiable? | Diagnosis Time |
|-------------|-------------|---------------------|------------------------|----------------|
| Worker crash | [ ] | ___ min | [ ] | ___ min |
| API timeout | [ ] | ___ min | [ ] | ___ min |
| Storage failure | [ ] | ___ min | [ ] | ___ min |
| Queue issue | [ ] | ___ min | [ ] | ___ min |
| State corruption | [ ] | ___ min | [ ] | ___ min |

TEST-RR-O02: End-to-End Trace Completeness

Steps:

  1. Trigger a workflow that touches every component.
  2. Using only the correlation ID, reconstruct the entire flow from logs.

Pass Criteria:

  • Every step is visible in logs.
  • Correlation ID links all entries.
  • External call details (request, response, duration) are logged.
  • Failure point is unambiguous.
  • Duration breakdown is available per step.

Full Risk Patterns Table

#PatternCategoryRiskDetectionMitigation
1No timeout on external callsTimeoutCRITICALStuck jobsConfigure per-call timeout
2No job-level timeoutTimeoutHIGHStuck jobsMax duration per job type
3Entire batch fails on single item errorPartialHIGHLost workPer-item error isolation
4Retry reprocesses all itemsRetryHIGHWasted costCheckpoint + selective retry
5Retry without idempotencyRetryHIGHDuplicatesIdempotency keys
6Infinite retry loopRetryCRITICALCost explosionMax retry + circuit breaker
7No checkpoint in batch processingResumeHIGHLost progressPer-item completion tracking
8Worker crash loses progressResumeHIGHWasted workDurable checkpoints
9Sequential loop without boundsLoopHIGHRunaway processMax iterations + duration
10No error isolation in loopsLoopHIGHCascading failureTry-catch per item
11Queue message lossQueueHIGHLost jobsDurable queue + monitoring
12No dead letter queueQueueMEDIUMSilent failureDLQ + alerting
13No visibility timeoutQueueHIGHDuplicate processingAppropriate timeout config
14Stuck jobs undetectedStateHIGHManual DB fixesStuck detection + auto-timeout
15Impossible states possibleStateHIGHData corruptionTransition validation + constraints
16No correlation ID in logsObservabilityHIGHSlow debuggingMiddleware-generated trace ID
17Generic error messagesObservabilityMEDIUMSlow debuggingSpecific errors with context
18No per-step timingObservabilityMEDIUMCan't find bottleneckDuration logging per step
19Missing external call loggingObservabilityHIGHBlind to provider issuesRequest/response/duration logging
20No alerting on critical failuresObservabilityHIGHSilent incidentsAlert rules on error rate + stuck jobs

Pass Criteria Summary

CRITICAL (Must pass for production):
[ ] Every external call has a timeout
[ ] Every job has a maximum duration
[ ] Batch processing isolates per-item errors
[ ] Retry is safe (idempotent, bounded, checkpointed)
[ ] Queue messages are durable
[ ] Stuck jobs are detected and auto-resolved
[ ] Correlation IDs trace requests end-to-end

HIGH (Should pass for reliability):
[ ] Resume from checkpoint after worker crash
[ ] Circuit breaker on cascading external failures
[ ] Sequential loops have bounds (max iterations, max duration)
[ ] Dead letter queue configured with alerting
[ ] Impossible states prevented by DB constraints
[ ] Per-step timing logged for performance diagnosis
[ ] Alerting on all critical failure modes

MEDIUM (Recommended for operational maturity):
[ ] Partial completion state exists (not binary success/fail)
[ ] Priority queue levels for interactive vs bulk jobs
[ ] Health check endpoint tests real dependencies
[ ] Runbook exists for each alert
[ ] Error messages include entity IDs and provider responses

Priority Targeting Methodology

Assess Your System

Answer these questions to prioritize which tasks to run first:

1. Do you have long-running workflows (> 30s)?
   YES -> Start with Task 1 (Map Workflows) + Task 3 (Timeouts)

2. Do workflows call external paid APIs?
   YES -> Prioritize Task 5 (Retry Safety) + Task 7 (Loop Safeguards)

3. Have users reported stuck/lost jobs?
   YES -> Prioritize Task 9 (State Accuracy) + Task 6 (Resume)

4. Is debugging production issues slow (> 30 min)?
   YES -> Prioritize Task 10 (Observability)

5. Do batch operations fail entirely on single-item errors?
   YES -> Prioritize Task 4 (Partial Completion) + Task 7 (Loop Safeguards)

6. Are jobs processed by a queue system?
   YES -> Prioritize Task 8 (Queue Resilience)

System Type Prioritization

AI/Media Pipeline:
  1. Timeout behavior (Task 3) -- expensive external calls can hang
  2. Retry safety (Task 5) -- retries cost real money
  3. Partial completion (Task 4) -- large batches must not restart from zero
  4. Observability (Task 10) -- must track provider call details

SaaS Platform:
  1. State accuracy (Task 9) -- user-facing state must be correct
  2. Queue resilience (Task 8) -- job processing is core business logic
  3. Resume behavior (Task 6) -- users expect progress to survive failures
  4. Observability (Task 10) -- multi-tenant debugging requires correlation

E-Commerce:
  1. Retry safety (Task 5) -- payment and inventory operations must be idempotent
  2. State accuracy (Task 9) -- order state drives fulfillment
  3. Timeout behavior (Task 3) -- payment gateway timeouts are common
  4. Queue resilience (Task 8) -- order processing queue is critical path

Execution Methodology

Phase 1: Discovery (1-2 days)

  • Complete Task 1 (Map Workflows)
  • Complete Task 2 (Identify High-Risk Steps)
  • Create risk-prioritized test plan

Phase 2: Core Testing (2-3 days)

  • Task 3: Timeout Behavior
  • Task 5: Retry Safety
  • Task 9: State Accuracy
  • Task 10: Observability

Phase 3: Deep Testing (2-3 days)

  • Task 4: Partial Completion
  • Task 6: Resume Behavior
  • Task 7: Sequential Loop Safeguards
  • Task 8: Queue Resilience

Phase 4: Remediation (ongoing)

  • Fix CRITICAL findings immediately
  • Schedule HIGH findings for next sprint
  • Track MEDIUM findings in backlog

What Earlier Audits Miss

Standard reliability testing verifies that the system handles known error cases. This audit matters because:

  • Unit tests test individual components in isolation. They never test what happens when a component fails while another depends on it.
  • Integration tests run in controlled environments. They never simulate network partitions, worker crashes, or queue message loss.
  • Load tests verify performance under sustained throughput. They do not inject faults during load.
  • The other 10 audits in this pack each cover a specific concern. This audit ties them together and tests the interactions between them.
  • Chaos engineering is often skipped because teams consider it "too risky for staging." This audit provides structured, safe fault injection tests.

This would be called a Reliability & Resilience Audit -- specifically testing whether the system produces correct results and recovers automatically under component failures, network issues, timeout conditions, and queue problems.


Automation Opportunities

TestAutomatable?Method
TEST-RR-T01: API timeoutYESMock slow external service; assert timeout and cleanup
TEST-RR-T02: Job timeoutYESInject hanging dependency; assert job-level timeout fires
TEST-RR-P01: Batch partial failureYESInject failure at item N; assert partial success state
TEST-RR-P02: Systemic failureYESKill mock API mid-batch; assert circuit breaker
TEST-RR-R01: Retry singleYESFail one item, retry, assert only that item reprocessed
TEST-RR-R03: Retry idempotencyYESDouble-retry; assert single execution
TEST-RR-RE01: Worker crashYESKill worker process; restart; assert resume from checkpoint
TEST-RR-SL01: Loop boundsYESFeed infinite pagination; assert loop terminates
TEST-RR-Q01: Queue durabilityYESRestart queue; assert messages preserved
TEST-RR-S01: State accuracyYESRun failure scenarios; query DB for impossible states
TEST-RR-O01: Diagnosis timeMANUALSimulate incident; measure time to root cause

Reusable Audit Report Template

# Reliability & Resilience Audit Report

## System: _______________
## Date: YYYY-MM-DD
## Auditor: _______________

## Workflow Inventory (Task 1)
| Workflow | Steps | Duration | External Dependencies | Risk |
|----------|-------|---------|----------------------|------|
| ___ | ___ | ___s | ___ | HIGH/MEDIUM/LOW |

## Test Results by Task
| Task | Tests Run | Passed | Failed | Critical Findings |
|------|----------|--------|--------|-------------------|
| 3. Timeouts | ___ | ___ | ___ | ___ |
| 4. Partial completion | ___ | ___ | ___ | ___ |
| 5. Retry safety | ___ | ___ | ___ | ___ |
| 6. Resume | ___ | ___ | ___ | ___ |
| 7. Loop safeguards | ___ | ___ | ___ | ___ |
| 8. Queue resilience | ___ | ___ | ___ | ___ |
| 9. State accuracy | ___ | ___ | ___ | ___ |
| 10. Observability | ___ | ___ | ___ | ___ |

## Overall Score: PASS / PARTIAL / FAIL

Post-Audit Deliverables

1. Workflow inventory (Task 1 output)
2. Risk assessment matrix (Task 2 output)
3. Test results per task (pass/partial/fail with evidence)
4. Finding severity classification (CRITICAL/HIGH/MEDIUM)
5. Remediation recommendations with effort estimates
6. Retest plan (verify fixes)

Install this skill directly: skilldb add production-audit-skills

Get CLI access →