UncategorizedProduction Audit736 lines
Reliability & Resilience Audit
Quick Summary18 lines
This is the most comprehensive audit in the production audit pack. It tests long-running, distributed workflows under real-world conditions: timeouts, partial failures, retries, queue issues, state corruption, and observability gaps. This audit encompasses and extends all other audits in the pack, providing a unified methodology for verifying production reliability. ## Key Points 1. Map every long-running workflow 2. Identify high-risk steps 3. Test timeout behavior 4. Test partial completion 5. Test retry safety 6. Test resume behavior 7. Test sequential loop safeguards 8. Test queue resilience 9. Test state accuracy 10. Test observability 1. List every user-triggerable operation that involves background processing. 2. For each, document the complete step sequence.
skilldb get production-audit-skills/reliability-resilience-auditFull skill: 736 linesInstall this skill directly: skilldb add production-audit-skills