Production Audit Master Checklist
This meta-skill ties together all 11 production audit skills into a prioritized execution plan. Use this as the starting point for any production audit engagement. It provides the full 15-category checklist, priority ordering for different system types, a reusable audit template, scoring rubric, and timeline estimation. ## Key Points 1. Cost Control -- Paid API calls can bankrupt you overnight 2. Idempotency -- Retries must not double-charge 3. Reliability -- Long-running jobs must survive failures 4. Recovery -- Interrupted pipelines must resume, not restart 5. State Machine -- Job states must be accurate and reachable 6. Observability -- Must trace individual generations through the stack 7. Throughput -- Batch operations must scale 8. Concurrency -- Multiple users generating simultaneously 9. Data Lifecycle -- Asset versioning, cleanup, export 10. Permissions -- Project sharing, tenant isolation 11. Security -- Auth, input validation 12. Human Error -- Accidental re-generation, delete safety ## Quick Example ``` | Rating | Criteria | |--------|----------| | PRODUCTION READY | All 15 categories PASS. No CRITICAL or HIGH findings. | | CONDITIONALLY READY | No CRITICAL findings. <= 3 categories PARTIAL. Workarounds in place. | | NOT READY | Any CRITICAL finding. OR > 3 categories FAIL. | ``` ``` Minimum (critical categories only, 5 categories): 5-8 business days Standard (all 15 categories, sequential): 15-20 business days Accelerated (all 15, 2 auditors parallel): 8-12 business days Focused (4 categories based on system type): 4-6 business days ```
skilldb get production-audit-skills/production-audit-master-checklistFull skill: 496 linesProduction Audit Master Checklist
Purpose
This meta-skill ties together all 11 production audit skills into a prioritized execution plan. Use this as the starting point for any production audit engagement. It provides the full 15-category checklist, priority ordering for different system types, a reusable audit template, scoring rubric, and timeline estimation.
The 15-Category Production Audit Framework
Category Overview
| # | Category | Audit Skill | Focus |
|---|---|---|---|
| 1 | Security | (external/standard) | Authentication, authorization, injection, encryption |
| 2 | UI/UX Safety | human-error-operator-audit | Destructive actions, error messaging, navigation safety |
| 3 | Code Quality | (external/standard) | Linting, type safety, test coverage, dead code |
| 4 | Data Integrity | data-lifecycle-audit | Referential integrity, orphans, cascade correctness |
| 5 | Reliability | reliability-resilience-audit | Timeouts, fault tolerance, chaos testing |
| 6 | Throughput | throughput-scale-audit | N+1 queries, pagination, load behavior |
| 7 | Recovery | recovery-resume-audit | Interrupted workflows, checkpoint, resume |
| 8 | Concurrency | concurrency-race-condition-audit | Race conditions, locking, duplicate prevention |
| 9 | State Machine | state-machine-audit | State definitions, transitions, impossible states |
| 10 | Idempotency | idempotency-audit | Retry safety, webhook dedupe, payment safety |
| 11 | Observability | observability-debuggability-audit | Logging, tracing, alerting, health checks |
| 12 | Cost Control | cost-explosion-audit | Runaway spend, retry storms, cache effectiveness |
| 13 | Permissions | permission-drift-audit | Access control, tenant isolation, role propagation |
| 14 | Data Lifecycle | data-lifecycle-audit | Duplication, deletion, export/import, versioning |
| 15 | Human Error | human-error-operator-audit | Accidental actions, undo, soft delete, recovery |
Priority Ordering by System Type
AI / Media Pipeline (e.g., image generation, video processing, content pipelines)
CRITICAL (Week 1):
1. Cost Control -- Paid API calls can bankrupt you overnight
2. Idempotency -- Retries must not double-charge
3. Reliability -- Long-running jobs must survive failures
4. Recovery -- Interrupted pipelines must resume, not restart
HIGH (Week 2):
5. State Machine -- Job states must be accurate and reachable
6. Observability -- Must trace individual generations through the stack
7. Throughput -- Batch operations must scale
MEDIUM (Week 3):
8. Concurrency -- Multiple users generating simultaneously
9. Data Lifecycle -- Asset versioning, cleanup, export
10. Permissions -- Project sharing, tenant isolation
STANDARD (Week 4):
11. Security -- Auth, input validation
12. Human Error -- Accidental re-generation, delete safety
13. UI/UX Safety -- Error messages, progress indicators
14. Code Quality -- Technical debt assessment
15. Data Integrity -- Referential integrity checks
SaaS Platform (e.g., project management, CRM, collaboration tools)
CRITICAL (Week 1):
1. Permissions -- Multi-tenant isolation is existential
2. Security -- Auth, session management, injection
3. Concurrency -- Multi-user editing, real-time collaboration
4. State Machine -- Workflow states drive business logic
HIGH (Week 2):
5. Data Integrity -- Referential integrity across entities
6. Recovery -- User work must survive interruptions
7. Observability -- Multi-tenant debugging requires correlation
MEDIUM (Week 3):
8. Throughput -- Dashboard and listing performance at scale
9. Human Error -- Undo, soft delete, navigation safety
10. Idempotency -- Form re-submission, webhook handling
STANDARD (Week 4):
11. Reliability -- Background job resilience
12. Data Lifecycle -- Export/import, archive/restore
13. Cost Control -- Resource quotas per tenant
14. UI/UX Safety -- Error messaging, destructive action guards
15. Code Quality -- Technical debt assessment
E-Commerce Platform (e.g., online store, marketplace, subscription service)
CRITICAL (Week 1):
1. Idempotency -- Double-charge prevention is existential
2. Security -- Payment data, PII protection
3. State Machine -- Order lifecycle must be bulletproof
4. Concurrency -- Inventory race conditions, flash sales
HIGH (Week 2):
5. Reliability -- Checkout flow must not fail silently
6. Recovery -- Cart and checkout must survive interruptions
7. Permissions -- Customer data isolation
MEDIUM (Week 3):
8. Throughput -- Scale for traffic spikes (sales events)
9. Observability -- Payment debugging, order tracing
10. Data Integrity -- Order-inventory-payment consistency
STANDARD (Week 4):
11. Cost Control -- Shipping API costs, email volume
12. Data Lifecycle -- Order history, GDPR deletion
13. Human Error -- Order cancellation, refund safety
14. UI/UX Safety -- Checkout error handling
15. Code Quality -- Technical debt assessment
API Platform (e.g., developer API, integration platform, data service)
CRITICAL (Week 1):
1. Idempotency -- Clients will retry; every endpoint must be safe
2. Security -- API key management, rate limiting, auth
3. Throughput -- Performance under concurrent client load
4. Observability -- Per-client request tracing, error debugging
HIGH (Week 2):
5. Concurrency -- Concurrent writes from multiple clients
6. Reliability -- Webhook delivery, async job resilience
7. Cost Control -- Per-client quotas, abuse prevention
MEDIUM (Week 3):
8. State Machine -- Async operation status lifecycle
9. Permissions -- API key scoping, resource ownership
10. Recovery -- Long-running async operations
STANDARD (Week 4):
11. Data Integrity -- Referential integrity, bulk operations
12. Data Lifecycle -- Data retention, export, deletion (GDPR)
13. Human Error -- API misuse protection, clear error responses
14. UI/UX Safety -- Dashboard and admin panel safety
15. Code Quality -- SDK quality, documentation accuracy
Reusable Audit Template
Use this template for each category audit:
# [Category] Audit Report
## System: [System Name]
## Date: [YYYY-MM-DD]
## Auditor: [Name]
## Scope: [What was included/excluded]
## Executive Summary
[2-3 sentences: overall assessment and critical findings]
## Findings
### Finding 1: [Title]
- **Severity:** CRITICAL / HIGH / MEDIUM / LOW
- **Category:** [Which of the 15 categories]
- **Description:** [What was found]
- **Evidence:** [How it was observed/reproduced]
- **Impact:** [What could happen in production]
- **Recommendation:** [How to fix]
- **Effort:** [Estimated hours/days to fix]
- **Status:** OPEN / IN PROGRESS / FIXED / ACCEPTED RISK
### Finding 2: [Title]
[Same structure]
## Test Results
| Test ID | Description | Result | Notes |
|---------|-------------|--------|-------|
| TEST-XX-001 | [What was tested] | PASS / FAIL / PARTIAL | [Details] |
| TEST-XX-002 | [What was tested] | PASS / FAIL / PARTIAL | [Details] |
## Category Score
- **Score:** PASS / PARTIAL / FAIL
- **Critical findings:** [count]
- **High findings:** [count]
- **Medium findings:** [count]
## Recommendations
1. [Prioritized list of remediation actions]
2. ...
## Retest Plan
- [When and how fixes will be verified]
Scoring Rubric
Per-Category Scoring
| Score | Criteria | Action Required |
|---|---|---|
| PASS | No CRITICAL or HIGH findings. All core tests pass. Minor issues only. | No blocking action. Address MEDIUM findings in normal sprint cycle. |
| PARTIAL | No CRITICAL findings. 1-2 HIGH findings with known workarounds. Most tests pass. | Fix HIGH findings within 2 sprints. Workarounds documented. |
| FAIL | Any CRITICAL finding. OR 3+ HIGH findings. OR core functionality broken. | Stop and fix before next release. May require incident response. |
Finding Severity Definitions
| Severity | Definition | Response Time |
|---|---|---|
| CRITICAL | Data loss, security breach, financial impact, or total feature failure likely in production. | Fix within 24-48 hours. May warrant hotfix. |
| HIGH | Significant reliability, data integrity, or user experience issue. Will affect multiple users. | Fix within current sprint (1-2 weeks). |
| MEDIUM | Moderate issue. Workarounds exist. Affects edge cases or non-critical paths. | Fix within next 2 sprints (2-4 weeks). |
| LOW | Minor issue. Cosmetic, optimization, or best-practice deviation. | Add to backlog. Fix opportunistically. |
Overall System Score
| Rating | Criteria |
|--------|----------|
| PRODUCTION READY | All 15 categories PASS. No CRITICAL or HIGH findings. |
| CONDITIONALLY READY | No CRITICAL findings. <= 3 categories PARTIAL. Workarounds in place. |
| NOT READY | Any CRITICAL finding. OR > 3 categories FAIL. |
Scoring Matrix Template
| # | Category | Score | Critical | High | Medium | Low | Notes |
|---|----------|-------|----------|------|--------|-----|-------|
| 1 | Security | | | | | | |
| 2 | UI/UX Safety | | | | | | |
| 3 | Code Quality | | | | | | |
| 4 | Data Integrity | | | | | | |
| 5 | Reliability | | | | | | |
| 6 | Throughput | | | | | | |
| 7 | Recovery | | | | | | |
| 8 | Concurrency | | | | | | |
| 9 | State Machine | | | | | | |
| 10 | Idempotency | | | | | | |
| 11 | Observability | | | | | | |
| 12 | Cost Control | | | | | | |
| 13 | Permissions | | | | | | |
| 14 | Data Lifecycle | | | | | | |
| 15 | Human Error | | | | | | |
| **OVERALL** | | | | | | | |
Timeline Estimation
Per-Category Audit Duration
| Category | Preparation | Testing | Analysis & Report | Total |
|---|---|---|---|---|
| Security | 2h | 4-8h | 2h | 1-1.5 days |
| UI/UX Safety | 1h | 2-4h | 1h | 0.5-1 day |
| Code Quality | 1h | 2-4h | 2h | 0.5-1 day |
| Data Integrity | 2h | 4-6h | 2h | 1-1.5 days |
| Reliability | 4h | 8-16h | 4h | 2-3 days |
| Throughput | 4h (data seeding) | 4-8h | 2h | 1.5-2 days |
| Recovery | 2h | 4-8h | 2h | 1-1.5 days |
| Concurrency | 2h | 4-8h | 2h | 1-1.5 days |
| State Machine | 2h | 4-6h | 2h | 1-1.5 days |
| Idempotency | 2h | 4-8h | 2h | 1-1.5 days |
| Observability | 1h | 2-4h | 2h | 0.5-1 day |
| Cost Control | 2h | 4-6h | 2h | 1-1.5 days |
| Permissions | 2h | 4-8h | 2h | 1-1.5 days |
| Data Lifecycle | 2h | 4-8h | 2h | 1-1.5 days |
| Human Error | 1h | 2-4h | 1h | 0.5-1 day |
Full Audit Timeline
Minimum (critical categories only, 5 categories): 5-8 business days
Standard (all 15 categories, sequential): 15-20 business days
Accelerated (all 15, 2 auditors parallel): 8-12 business days
Focused (4 categories based on system type): 4-6 business days
Recommended Cadence
Full audit: Quarterly (every 3 months) or before major releases
Focused audit: Monthly (rotate through 4 categories per month)
Spot check: After any significant architecture change
Continuous: Automated tests for critical categories in CI/CD
Audit Execution Playbook
Phase 1: Planning (Day 1)
1. Identify system type (AI pipeline, SaaS, e-commerce, API platform)
2. Select priority ordering from the templates above
3. Determine scope (full 15 or focused subset)
4. Gather prerequisites:
- Architecture diagram
- Data model / schema
- Deployment topology
- External dependency list
- Incident history (last 6 months)
- Known technical debt list
5. Set up test environment (staging or dedicated audit environment)
6. Prepare test data (seed scripts from throughput-scale-audit)
Phase 2: Discovery (Days 2-3)
1. Complete reliability-resilience-audit Task 1 (Map Workflows)
2. Complete reliability-resilience-audit Task 2 (Identify High-Risk Steps)
3. Review codebase for patterns:
- State management approach
- Error handling patterns
- External API integration patterns
- Queue/job processing patterns
4. Output: Risk-prioritized test plan
Phase 3: Testing (Days 4-12)
Execute category audits in priority order.
For each category:
1. Run all concrete test cases from the skill file
2. Document findings with evidence
3. Classify severity
4. Note quick wins (< 2 hours to fix)
Phase 4: Reporting (Days 13-15)
1. Complete scoring matrix for all categories
2. Write executive summary
3. Prioritize findings into:
- Fix immediately (CRITICAL)
- Fix this sprint (HIGH)
- Fix next sprint (MEDIUM)
- Backlog (LOW)
4. Create remediation plan with effort estimates
5. Present to stakeholders
Phase 5: Remediation Tracking
1. Create tickets for all findings
2. Track fix progress
3. Schedule retest for CRITICAL and HIGH findings
4. Update audit scorecard as fixes are verified
Quick-Start: One-Day Focused Audit
If you only have one day, run this focused assessment:
Morning (4 hours):
1. Map top 3 critical workflows (reliability-resilience Task 1)
2. Run idempotency spot checks on payment/billing endpoints
3. Run state machine audit on the primary job lifecycle
4. Check for N+1 queries on main listing pages
Afternoon (4 hours):
5. Test cross-tenant data access (permission-drift, 3 endpoints)
6. Verify retry safety on one external API workflow
7. Check error message quality on 5 failure scenarios
8. Verify delete safety (confirmation + soft delete) on main entities
Output:
- Top 5 critical findings with severity
- Quick-win recommendations (< 2h fixes)
- Full audit recommendation with priority areas
Skill Cross-Reference
Each audit skill references others. Use this map to understand dependencies:
reliability-resilience-audit (comprehensive)
├── references: recovery-resume-audit (Task 4, 6)
├── references: idempotency-audit (Task 5)
├── references: state-machine-audit (Task 9)
├── references: observability-debuggability-audit (Task 10)
└── references: cost-explosion-audit (Task 7 safeguards)
concurrency-race-condition-audit
├── references: idempotency-audit (dedupe patterns)
└── references: state-machine-audit (transition locking)
data-lifecycle-audit
├── references: permission-drift-audit (duplication ACLs)
└── references: state-machine-audit (archive states)
human-error-operator-audit
├── references: data-lifecycle-audit (soft delete, versioning)
└── references: recovery-resume-audit (retry safety from user perspective)
cost-explosion-audit
├── references: idempotency-audit (duplicate API calls)
└── references: reliability-resilience-audit (retry storms)
throughput-scale-audit
├── references: observability-debuggability-audit (performance logging)
└── references: concurrency-race-condition-audit (concurrent load)
Automation Opportunities
Categories that can be partially automated in CI/CD:
HIGH AUTOMATION POTENTIAL:
[ ] Throughput: Automated load tests in CI (k6, artillery)
[ ] Idempotency: Automated duplicate-request tests
[ ] State Machine: Automated transition validation tests
[ ] Data Integrity: Automated orphan detection queries (scheduled)
[ ] Code Quality: Static analysis, linting, type checking
MEDIUM AUTOMATION POTENTIAL:
[ ] Security: SAST/DAST scanning, dependency audit
[ ] Observability: Automated log format validation
[ ] Permissions: Automated cross-tenant access tests
LOW AUTOMATION POTENTIAL (manual testing required):
[ ] Human Error: UX safety review
[ ] Recovery: Failure injection testing (chaos engineering)
[ ] Cost Control: Cost estimation review
[ ] Concurrency: Race condition testing (hard to automate reliably)
What Earlier Audits Miss
Standard checklists verify individual concerns in isolation. This master checklist matters because:
- Individual audit skills each cover one domain. Without this meta-skill, auditors miss cross-cutting concerns (e.g., a permission bug that causes a cost explosion via unauthorized API access).
- Ad-hoc audits focus on the area that last caused an incident. This framework ensures systematic coverage so silent risks are not ignored.
- One-time audits produce a report that goes stale. This framework provides a repeatable cadence with scoring to track improvement.
- Generic checklists apply the same priority to every system. This framework provides system-type-specific prioritization.
- Team-dependent knowledge means audits vary by who runs them. This framework standardizes methodology so results are comparable across auditors and time periods.
This would be called a Production Audit Master Checklist -- specifically providing a unified framework for systematically evaluating production readiness across all 15 audit categories with system-type-specific prioritization and scoring.
Risk Pattern Table
| Pattern | What It Hits | Risk | Symptom |
|---|---|---|---|
| No structured audit process | All categories | HIGH | Bugs found reactively (by users) instead of proactively |
| Audit focused only on last incident | Coverage gaps | HIGH | Same categories tested repeatedly; others never tested |
| No severity classification | Prioritization | MEDIUM | All findings treated equally; critical bugs not prioritized |
| Audit results not tracked over time | Improvement | MEDIUM | Same findings rediscovered each quarter |
| No system-type prioritization | Efficiency | MEDIUM | SaaS system audited with e-commerce priorities |
| Single auditor, no review | Quality | MEDIUM | Blind spots in auditor's expertise areas |
| No retest after remediation | Verification | HIGH | Fixes assumed correct; regressions not caught |
| Audit scope too narrow | Coverage | HIGH | Only API tested; background jobs, workers, and admin panels skipped |
Final Notes
This checklist is a living document. After each audit:
- Update the scoring matrix with current results.
- Track improvement over time (quarter-over-quarter).
- Add system-specific test cases discovered during testing.
- Document false positives to avoid re-investigation.
- Share findings across teams to prevent the same bugs in other systems.
The goal is not perfection on day one. The goal is systematic improvement with clear visibility into what is safe and what is not.
Install this skill directly: skilldb add production-audit-skills