Skip to main content
UncategorizedProduction Audit736 lines

Reliability & Resilience Audit

Quick Summary18 lines
This is the most comprehensive audit in the production audit pack. It tests long-running, distributed workflows under real-world conditions: timeouts, partial failures, retries, queue issues, state corruption, and observability gaps. This audit encompasses and extends all other audits in the pack, providing a unified methodology for verifying production reliability.

## Key Points

1. Map every long-running workflow
2. Identify high-risk steps
3. Test timeout behavior
4. Test partial completion
5. Test retry safety
6. Test resume behavior
7. Test sequential loop safeguards
8. Test queue resilience
9. Test state accuracy
10. Test observability
1. List every user-triggerable operation that involves background processing.
2. For each, document the complete step sequence.
skilldb get production-audit-skills/reliability-resilience-auditFull skill: 736 lines

Install this skill directly: skilldb add production-audit-skills

Get CLI access →