Reliability & Resilience Audit
This is the most comprehensive audit in the production audit pack. It tests long-running, distributed workflows under real-world conditions: timeouts, partial failures, retries, queue issues, state corruption, and observability gaps. This audit encompasses and extends all other audits in the pack, providing a unified methodology for verifying production reliability. ## Key Points 1. Map every long-running workflow 2. Identify high-risk steps 3. Test timeout behavior 4. Test partial completion 5. Test retry safety 6. Test resume behavior 7. Test sequential loop safeguards 8. Test queue resilience 9. Test state accuracy 10. Test observability 1. List every user-triggerable operation that involves background processing. 2. For each, document the complete step sequence.
skilldb get production-audit-skills/reliability-resilience-auditFull skill: 736 linesReliability & Resilience Audit
Purpose
This is the most comprehensive audit in the production audit pack. It tests long-running, distributed workflows under real-world conditions: timeouts, partial failures, retries, queue issues, state corruption, and observability gaps. This audit encompasses and extends all other audits in the pack, providing a unified methodology for verifying production reliability.
A system is reliable when it produces correct results even when components fail. A system is resilient when it recovers from failures without human intervention. This audit verifies both.
The 10 Audit Tasks
This audit follows a structured 10-task methodology that covers every reliability concern in a distributed system:
- Map every long-running workflow
- Identify high-risk steps
- Test timeout behavior
- Test partial completion
- Test retry safety
- Test resume behavior
- Test sequential loop safeguards
- Test queue resilience
- Test state accuracy
- Test observability
Task 1: Map Every Long-Running Workflow
Objective: Create a complete inventory of every workflow that takes more than 5 seconds or involves multiple steps.
Methodology
- List every user-triggerable operation that involves background processing.
- For each, document the complete step sequence.
- Identify external dependencies at each step.
Workflow Inventory Template
WORKFLOW: Asset Generation Pipeline
Trigger: User clicks "Generate" on project
Expected Duration: 30s - 5min (depending on asset count)
Steps:
1. Validate input (< 1s) [Internal]
2. Check user quota (< 1s) [Database]
3. Create job record (< 1s) [Database]
4. Enqueue generation tasks (< 1s) [Queue]
5. For each asset:
5a. Call AI provider (5-30s) [External: OpenAI/Anthropic]
5b. Upload result to storage (1-5s) [External: GCS/S3]
5c. Generate thumbnail (1-3s) [Internal/External]
5d. Update asset record (< 1s) [Database]
6. Update job status (< 1s) [Database]
7. Notify user (< 1s) [External: Email/Push]
Total steps: 7 (+ N * 4 for N assets)
External dependencies: AI Provider, Cloud Storage, Notification Service
Failure modes: Provider timeout, storage failure, quota exceeded, queue down
WORKFLOW: Project Export
Trigger: User clicks "Export Project"
Expected Duration: 10s - 2min
Steps:
1. Validate permissions (< 1s) [Internal]
2. Gather project metadata (< 1s) [Database]
3. Download all asset files (1-60s) [External: Storage]
4. Package into archive (1-30s) [Internal: Compute]
5. Upload archive to temp storage (1-10s) [External: Storage]
6. Generate download URL (< 1s) [Internal]
7. Notify user (< 1s) [External: Email/Push]
WORKFLOW: [Document each workflow in your system]
Completeness Check
[ ] All user-triggered background operations listed
[ ] All scheduled/cron jobs listed
[ ] All webhook-triggered workflows listed
[ ] All admin-triggered operations listed
[ ] Each workflow has: step sequence, timing, dependencies
[ ] External dependencies highlighted for each step
Task 2: Identify High-Risk Steps
Objective: For each workflow, identify the steps most likely to fail and with highest impact.
Risk Assessment Matrix
| Step | Failure Probability | Impact if Fails | Detection Difficulty | Risk Score |
|---|---|---|---|---|
| AI provider call | HIGH (timeout, rate limit, content filter) | HIGH (no output) | LOW (error returned) | CRITICAL |
| Storage upload | MEDIUM (transient 503) | HIGH (output lost) | LOW (error returned) | HIGH |
| Database write | LOW (connection issues) | HIGH (state lost) | LOW (error returned) | MEDIUM |
| Queue enqueue | LOW (queue down) | HIGH (job lost) | MEDIUM (may not error) | HIGH |
| Notification send | MEDIUM (rate limit, bounce) | LOW (not blocking) | HIGH (fire-and-forget) | MEDIUM |
| Thumbnail generation | MEDIUM (OOM on large files) | LOW (degraded UX) | MEDIUM (may crash silently) | MEDIUM |
High-Risk Step Indicators
A step is HIGH-RISK if:
[ ] It calls an external service (network dependency)
[ ] It takes more than 10 seconds (timeout vulnerability)
[ ] It modifies state that is hard to undo (irreversible)
[ ] It processes user-provided input (unpredictable size/format)
[ ] It runs in a loop (N items = N chances to fail)
[ ] Its failure is hard to detect (fire-and-forget, no response check)
[ ] It has no retry mechanism
[ ] It has no timeout configured
Task 3: Test Timeout Behavior
Objective: Verify that every long-running step has appropriate timeout configuration and handles timeout gracefully.
Concrete Tests
TEST-RR-T01: External API Call Timeout
Steps:
- Configure a mock external service that delays 120 seconds.
- Trigger the workflow.
- Observe: does the call timeout? After how long?
- What happens after timeout?
Pass Criteria:
- Timeout is configured (not waiting indefinitely).
- Timeout is appropriate (30s for API calls, 300s for heavy processing).
- On timeout: error is logged with specific "timeout" classification.
- On timeout: job moves to retry or failed state (not stuck in "processing").
- On timeout: resources are cleaned up (connections released, temp files removed).
TEST-RR-T02: Job-Level Timeout
Steps:
- Start a job that will process 20 items.
- Make one item hang indefinitely (mock a non-responding dependency).
- Observe: does the overall job timeout?
Pass Criteria:
- Job-level timeout exists (e.g., 30 minutes maximum).
- Timeout is independent of individual step timeouts.
- On job timeout: completed items are preserved.
- On job timeout: job state changes to "timeout" or "failed".
- On job timeout: alert fires.
Timeout Configuration Audit
| Component | Timeout | Default | Configured | Appropriate? |
|-----------|---------|---------|-----------|-------------|
| HTTP client (outgoing) | Connect | 5s | ? | |
| HTTP client (outgoing) | Read | 30s | ? | |
| HTTP server (incoming) | Request | 60s | ? | |
| Database query | Statement | 30s | ? | |
| Database connection | Acquire | 5s | ? | |
| Queue job | Processing | 300s | ? | |
| Queue job | Visibility | 60s | ? | |
| Worker | Heartbeat | 30s | ? | |
| Storage upload | Per-file | 120s | ? | |
| Webhook delivery | Per-attempt | 10s | ? | |
Task 4: Test Partial Completion
Objective: Verify that partial completion is handled correctly when a batch fails partway through.
Concrete Tests
TEST-RR-P01: Batch Fails at Item 7 of 20
Steps:
- Start a batch of 20 items.
- Cause item 7 to fail (bad input for that specific item).
- Observe: do items 8-20 still process?
Pass Criteria:
- Items 1-6: completed successfully, results preserved.
- Item 7: marked as failed with specific error.
- Items 8-20: continue processing (not blocked by item 7).
- Batch status: "partial" (not "failed" for entire batch).
- User can see: 19/20 succeeded, 1/20 failed.
- User can retry just item 7.
TEST-RR-P02: Batch Fails at Item 7 Due to Systemic Issue
Steps:
- Start a batch of 20 items.
- After item 6, kill the external API (all remaining items will fail).
- Observe: does the batch keep trying all 14 remaining items?
Pass Criteria:
- Circuit breaker trips after 3-5 consecutive failures.
- Remaining items are NOT attempted (fast-fail).
- Items 1-6: completed, preserved.
- Items 7-9: failed (attempted before circuit break).
- Items 10-20: skipped / queued for retry (not attempted during outage).
- User notification: "Generation paused: AI provider is unavailable. 6/20 complete. Will retry automatically."
Task 5: Test Retry Safety
Objective: Verify that retrying failed operations is safe and produces correct results.
Concrete Tests
TEST-RR-R01: Retry Single Failed Item
Steps:
- Complete a batch with 1 failed item.
- Click "Retry" on the failed item.
- Verify: only the failed item is reprocessed.
Pass Criteria:
- Only the failed item is sent to the external API.
- Successfully completed items are NOT reprocessed.
- Retry uses the same parameters as the original attempt.
- On success: item moves to "completed", batch to "completed".
- On failure: retry count incremented, error updated.
- Max retry limit enforced (no infinite retry).
TEST-RR-R02: Retry Entire Batch
Steps:
- Complete a batch with 5 failed items.
- Click "Retry All Failed."
- Verify: only 5 items are reprocessed.
Pass Criteria:
- Exactly 5 API calls made (not 20).
- Completed items untouched.
- Each retried item has retry_count incremented.
TEST-RR-R03: Retry Idempotency
Steps:
- Retry a failed item.
- Retry the same item again before the first retry completes.
Pass Criteria:
- Second retry is blocked ("Retry already in progress").
- Only one retry executes.
- No duplicate API calls.
Task 6: Test Resume Behavior
Objective: Verify that interrupted workflows resume from the correct point.
Concrete Tests
TEST-RR-RE01: Worker Crash Mid-Batch
Steps:
- Start a 20-item batch.
- Wait until 7 items complete.
- Kill the worker process.
- Start a new worker.
- Observe: does the batch resume from item 8?
Pass Criteria:
- Items 1-7 preserved (not reprocessed).
- Processing resumes from item 8.
- Total external API calls = 20 (not 20 + 7 duplicates).
- Final result: all 20 items completed.
- No user intervention required.
TEST-RR-RE02: Server Restart Mid-Request
Steps:
- Send an API request for a long operation.
- Restart the API server.
- Client receives connection error.
- Client retries the request.
Pass Criteria:
- If operation was committed: retry returns existing result.
- If operation was not committed: retry re-executes.
- No duplicate side effects.
- Client can detect server restart (health check endpoint).
Task 7: Test Sequential Loop Safeguards
Objective: Verify that operations processing items in a loop have safeguards against runaway behavior.
Concrete Tests
TEST-RR-SL01: Loop Does Not Run Forever
Steps:
- Start a processing loop with a known count of items.
- Verify: loop terminates after processing all items.
- Test: what if the item source keeps returning items? (Pagination bug, always
has_more=true)
Pass Criteria:
- Loop has a maximum iteration limit (e.g., 10,000).
- Loop has a maximum duration limit (e.g., 30 minutes).
- Either limit triggers: stop processing, log warning, alert.
- Pagination has a known endpoint (empty page = stop).
TEST-RR-SL02: Per-Item Error Isolation
Steps:
- Process a loop of 20 items where item 5 throws an unexpected exception.
- Observe: does item 5's error crash the entire loop?
Pass Criteria:
- Item 5 is caught, logged, and marked as failed.
- Items 6-20 continue processing.
- The loop does not exit on the first error.
- Error count is tracked; if > threshold (e.g., 50%), loop stops early.
Loop Safety Template
MAX_ITERATIONS = 10000
MAX_DURATION = timedelta(minutes=30)
MAX_CONSECUTIVE_ERRORS = 5
start_time = datetime.now()
consecutive_errors = 0
processed = 0
for item in get_items():
# Safeguard: max iterations
processed += 1
if processed > MAX_ITERATIONS:
log.error(f"Loop exceeded max iterations: {MAX_ITERATIONS}")
break
# Safeguard: max duration
if datetime.now() - start_time > MAX_DURATION:
log.error(f"Loop exceeded max duration: {MAX_DURATION}")
break
# Safeguard: consecutive errors
try:
process_item(item)
consecutive_errors = 0 # Reset on success
except Exception as e:
consecutive_errors += 1
log.error(f"Item {item.id} failed: {e}")
record_item_failure(item, e)
if consecutive_errors >= MAX_CONSECUTIVE_ERRORS:
log.error(f"Circuit break: {MAX_CONSECUTIVE_ERRORS} consecutive errors")
break
Task 8: Test Queue Resilience
Objective: Verify that the job queue handles failure modes gracefully.
Concrete Tests
TEST-RR-Q01: Queue Message Loss
Steps:
- Enqueue a job.
- Simulate queue restart/failure before worker picks up the job.
- After queue recovers, check: is the job still in the queue?
Pass Criteria:
- Queue messages are durable (persisted to disk, not just memory).
- Queue recovery preserves unacknowledged messages.
- No message loss during queue restart.
TEST-RR-Q02: Worker Failure to Acknowledge
Steps:
- Worker picks up a job.
- Worker crashes before acknowledging (completing) the job.
- Observe: is the job redelivered?
Pass Criteria:
- Job is redelivered after visibility timeout.
- Visibility timeout is shorter than job timeout.
- Redelivered job is processed idempotently (no duplicate work).
- Dead letter queue catches jobs that fail repeatedly (after N redeliveries).
TEST-RR-Q03: Queue Backpressure
Steps:
- Flood the queue with 1000 jobs.
- Observe worker behavior.
- Add more jobs while the 1000 are processing.
Pass Criteria:
- Workers process at sustainable rate (not crashing from overload).
- New jobs are accepted and queued (not rejected).
- Queue depth is monitored and alerted.
- Priority jobs are processed before bulk jobs.
- Memory does not grow unbounded on workers.
Queue Architecture Audit
| Aspect | Configuration | Status |
|--------|-------------|--------|
| Persistence | [ ] Durable [ ] In-memory | |
| Delivery guarantee | [ ] At-least-once [ ] Exactly-once | |
| Visibility timeout | ___ seconds | |
| Max redeliveries | ___ attempts | |
| Dead letter queue | [ ] Configured [ ] Not configured | |
| Priority levels | [ ] Yes (___ levels) [ ] No | |
| Backpressure | [ ] Max queue depth [ ] Max concurrent | |
| Monitoring | [ ] Queue depth [ ] Processing rate [ ] Error rate | |
| Alerting | [ ] Queue growing [ ] DLQ items [ ] No workers | |
Task 9: Test State Accuracy
Objective: Verify that system state accurately reflects reality at all times, especially after failures.
Concrete Tests
TEST-RR-S01: State After Each Failure Mode
For each failure mode, verify the resulting state is accurate:
| Failure Mode | Expected State | Actual State | Accurate? |
|-------------|---------------|-------------|-----------|
| Worker crash during processing | timeout/failed (after timeout) | | [ ] |
| API call timeout | failed/retrying | | [ ] |
| Storage upload failure | failed (asset), processing (job) | | [ ] |
| Queue message lost | stuck detection catches it | | [ ] |
| Partial batch completion | partial (job), mix of complete/failed (items) | | [ ] |
| User cancellation during processing | cancelled (job + remaining items) | | [ ] |
| Server restart during request | queued/failed (depends on commit point) | | [ ] |
TEST-RR-S02: No "Impossible" States After Failure
Steps:
- After each failure test above, check for impossible states.
- Query the database for anomalies.
Impossible State Queries:
-- Jobs stuck in processing (should have timed out)
SELECT * FROM jobs
WHERE status = 'processing'
AND updated_at < NOW() - INTERVAL '2 hours';
-- Completed jobs with no output
SELECT * FROM jobs
WHERE status = 'completed'
AND NOT EXISTS (SELECT 1 FROM assets WHERE job_id = jobs.id AND status = 'ready');
-- Items processing with no active worker
SELECT * FROM job_items
WHERE status = 'processing'
AND worker_id NOT IN (SELECT id FROM workers WHERE last_heartbeat > NOW() - INTERVAL '5 minutes');
-- Batch "completed" but has failed items
SELECT j.id, j.status,
COUNT(*) FILTER (WHERE ji.status = 'failed') as failed_count
FROM jobs j JOIN job_items ji ON ji.job_id = j.id
WHERE j.status = 'completed'
GROUP BY j.id, j.status
HAVING COUNT(*) FILTER (WHERE ji.status = 'failed') > 0;
Task 10: Test Observability
Objective: Verify that every failure mode produces sufficient diagnostic information.
Concrete Tests
TEST-RR-O01: Failure Diagnosis Time
For each failure mode tested above, measure: how long would it take an engineer to diagnose the root cause using only production logs and metrics?
Target: < 15 minutes from alert to root cause identification.
| Failure Mode | Alert Fired? | Time to Find in Logs | Root Cause Identifiable? | Diagnosis Time |
|-------------|-------------|---------------------|------------------------|----------------|
| Worker crash | [ ] | ___ min | [ ] | ___ min |
| API timeout | [ ] | ___ min | [ ] | ___ min |
| Storage failure | [ ] | ___ min | [ ] | ___ min |
| Queue issue | [ ] | ___ min | [ ] | ___ min |
| State corruption | [ ] | ___ min | [ ] | ___ min |
TEST-RR-O02: End-to-End Trace Completeness
Steps:
- Trigger a workflow that touches every component.
- Using only the correlation ID, reconstruct the entire flow from logs.
Pass Criteria:
- Every step is visible in logs.
- Correlation ID links all entries.
- External call details (request, response, duration) are logged.
- Failure point is unambiguous.
- Duration breakdown is available per step.
Full Risk Patterns Table
| # | Pattern | Category | Risk | Detection | Mitigation |
|---|---|---|---|---|---|
| 1 | No timeout on external calls | Timeout | CRITICAL | Stuck jobs | Configure per-call timeout |
| 2 | No job-level timeout | Timeout | HIGH | Stuck jobs | Max duration per job type |
| 3 | Entire batch fails on single item error | Partial | HIGH | Lost work | Per-item error isolation |
| 4 | Retry reprocesses all items | Retry | HIGH | Wasted cost | Checkpoint + selective retry |
| 5 | Retry without idempotency | Retry | HIGH | Duplicates | Idempotency keys |
| 6 | Infinite retry loop | Retry | CRITICAL | Cost explosion | Max retry + circuit breaker |
| 7 | No checkpoint in batch processing | Resume | HIGH | Lost progress | Per-item completion tracking |
| 8 | Worker crash loses progress | Resume | HIGH | Wasted work | Durable checkpoints |
| 9 | Sequential loop without bounds | Loop | HIGH | Runaway process | Max iterations + duration |
| 10 | No error isolation in loops | Loop | HIGH | Cascading failure | Try-catch per item |
| 11 | Queue message loss | Queue | HIGH | Lost jobs | Durable queue + monitoring |
| 12 | No dead letter queue | Queue | MEDIUM | Silent failure | DLQ + alerting |
| 13 | No visibility timeout | Queue | HIGH | Duplicate processing | Appropriate timeout config |
| 14 | Stuck jobs undetected | State | HIGH | Manual DB fixes | Stuck detection + auto-timeout |
| 15 | Impossible states possible | State | HIGH | Data corruption | Transition validation + constraints |
| 16 | No correlation ID in logs | Observability | HIGH | Slow debugging | Middleware-generated trace ID |
| 17 | Generic error messages | Observability | MEDIUM | Slow debugging | Specific errors with context |
| 18 | No per-step timing | Observability | MEDIUM | Can't find bottleneck | Duration logging per step |
| 19 | Missing external call logging | Observability | HIGH | Blind to provider issues | Request/response/duration logging |
| 20 | No alerting on critical failures | Observability | HIGH | Silent incidents | Alert rules on error rate + stuck jobs |
Pass Criteria Summary
CRITICAL (Must pass for production):
[ ] Every external call has a timeout
[ ] Every job has a maximum duration
[ ] Batch processing isolates per-item errors
[ ] Retry is safe (idempotent, bounded, checkpointed)
[ ] Queue messages are durable
[ ] Stuck jobs are detected and auto-resolved
[ ] Correlation IDs trace requests end-to-end
HIGH (Should pass for reliability):
[ ] Resume from checkpoint after worker crash
[ ] Circuit breaker on cascading external failures
[ ] Sequential loops have bounds (max iterations, max duration)
[ ] Dead letter queue configured with alerting
[ ] Impossible states prevented by DB constraints
[ ] Per-step timing logged for performance diagnosis
[ ] Alerting on all critical failure modes
MEDIUM (Recommended for operational maturity):
[ ] Partial completion state exists (not binary success/fail)
[ ] Priority queue levels for interactive vs bulk jobs
[ ] Health check endpoint tests real dependencies
[ ] Runbook exists for each alert
[ ] Error messages include entity IDs and provider responses
Priority Targeting Methodology
Assess Your System
Answer these questions to prioritize which tasks to run first:
1. Do you have long-running workflows (> 30s)?
YES -> Start with Task 1 (Map Workflows) + Task 3 (Timeouts)
2. Do workflows call external paid APIs?
YES -> Prioritize Task 5 (Retry Safety) + Task 7 (Loop Safeguards)
3. Have users reported stuck/lost jobs?
YES -> Prioritize Task 9 (State Accuracy) + Task 6 (Resume)
4. Is debugging production issues slow (> 30 min)?
YES -> Prioritize Task 10 (Observability)
5. Do batch operations fail entirely on single-item errors?
YES -> Prioritize Task 4 (Partial Completion) + Task 7 (Loop Safeguards)
6. Are jobs processed by a queue system?
YES -> Prioritize Task 8 (Queue Resilience)
System Type Prioritization
AI/Media Pipeline:
1. Timeout behavior (Task 3) -- expensive external calls can hang
2. Retry safety (Task 5) -- retries cost real money
3. Partial completion (Task 4) -- large batches must not restart from zero
4. Observability (Task 10) -- must track provider call details
SaaS Platform:
1. State accuracy (Task 9) -- user-facing state must be correct
2. Queue resilience (Task 8) -- job processing is core business logic
3. Resume behavior (Task 6) -- users expect progress to survive failures
4. Observability (Task 10) -- multi-tenant debugging requires correlation
E-Commerce:
1. Retry safety (Task 5) -- payment and inventory operations must be idempotent
2. State accuracy (Task 9) -- order state drives fulfillment
3. Timeout behavior (Task 3) -- payment gateway timeouts are common
4. Queue resilience (Task 8) -- order processing queue is critical path
Execution Methodology
Phase 1: Discovery (1-2 days)
- Complete Task 1 (Map Workflows)
- Complete Task 2 (Identify High-Risk Steps)
- Create risk-prioritized test plan
Phase 2: Core Testing (2-3 days)
- Task 3: Timeout Behavior
- Task 5: Retry Safety
- Task 9: State Accuracy
- Task 10: Observability
Phase 3: Deep Testing (2-3 days)
- Task 4: Partial Completion
- Task 6: Resume Behavior
- Task 7: Sequential Loop Safeguards
- Task 8: Queue Resilience
Phase 4: Remediation (ongoing)
- Fix CRITICAL findings immediately
- Schedule HIGH findings for next sprint
- Track MEDIUM findings in backlog
What Earlier Audits Miss
Standard reliability testing verifies that the system handles known error cases. This audit matters because:
- Unit tests test individual components in isolation. They never test what happens when a component fails while another depends on it.
- Integration tests run in controlled environments. They never simulate network partitions, worker crashes, or queue message loss.
- Load tests verify performance under sustained throughput. They do not inject faults during load.
- The other 10 audits in this pack each cover a specific concern. This audit ties them together and tests the interactions between them.
- Chaos engineering is often skipped because teams consider it "too risky for staging." This audit provides structured, safe fault injection tests.
This would be called a Reliability & Resilience Audit -- specifically testing whether the system produces correct results and recovers automatically under component failures, network issues, timeout conditions, and queue problems.
Automation Opportunities
| Test | Automatable? | Method |
|---|---|---|
| TEST-RR-T01: API timeout | YES | Mock slow external service; assert timeout and cleanup |
| TEST-RR-T02: Job timeout | YES | Inject hanging dependency; assert job-level timeout fires |
| TEST-RR-P01: Batch partial failure | YES | Inject failure at item N; assert partial success state |
| TEST-RR-P02: Systemic failure | YES | Kill mock API mid-batch; assert circuit breaker |
| TEST-RR-R01: Retry single | YES | Fail one item, retry, assert only that item reprocessed |
| TEST-RR-R03: Retry idempotency | YES | Double-retry; assert single execution |
| TEST-RR-RE01: Worker crash | YES | Kill worker process; restart; assert resume from checkpoint |
| TEST-RR-SL01: Loop bounds | YES | Feed infinite pagination; assert loop terminates |
| TEST-RR-Q01: Queue durability | YES | Restart queue; assert messages preserved |
| TEST-RR-S01: State accuracy | YES | Run failure scenarios; query DB for impossible states |
| TEST-RR-O01: Diagnosis time | MANUAL | Simulate incident; measure time to root cause |
Reusable Audit Report Template
# Reliability & Resilience Audit Report
## System: _______________
## Date: YYYY-MM-DD
## Auditor: _______________
## Workflow Inventory (Task 1)
| Workflow | Steps | Duration | External Dependencies | Risk |
|----------|-------|---------|----------------------|------|
| ___ | ___ | ___s | ___ | HIGH/MEDIUM/LOW |
## Test Results by Task
| Task | Tests Run | Passed | Failed | Critical Findings |
|------|----------|--------|--------|-------------------|
| 3. Timeouts | ___ | ___ | ___ | ___ |
| 4. Partial completion | ___ | ___ | ___ | ___ |
| 5. Retry safety | ___ | ___ | ___ | ___ |
| 6. Resume | ___ | ___ | ___ | ___ |
| 7. Loop safeguards | ___ | ___ | ___ | ___ |
| 8. Queue resilience | ___ | ___ | ___ | ___ |
| 9. State accuracy | ___ | ___ | ___ | ___ |
| 10. Observability | ___ | ___ | ___ | ___ |
## Overall Score: PASS / PARTIAL / FAIL
Post-Audit Deliverables
1. Workflow inventory (Task 1 output)
2. Risk assessment matrix (Task 2 output)
3. Test results per task (pass/partial/fail with evidence)
4. Finding severity classification (CRITICAL/HIGH/MEDIUM)
5. Remediation recommendations with effort estimates
6. Retest plan (verify fixes)
Install this skill directly: skilldb add production-audit-skills