Skip to main content
UncategorizedProduction Audit496 lines

Production Audit Master Checklist

Quick Summary35 lines
This meta-skill ties together all 11 production audit skills into a prioritized execution plan. Use this as the starting point for any production audit engagement. It provides the full 15-category checklist, priority ordering for different system types, a reusable audit template, scoring rubric, and timeline estimation.

## Key Points

1. Cost Control      -- Paid API calls can bankrupt you overnight
2. Idempotency       -- Retries must not double-charge
3. Reliability       -- Long-running jobs must survive failures
4. Recovery          -- Interrupted pipelines must resume, not restart
5. State Machine     -- Job states must be accurate and reachable
6. Observability     -- Must trace individual generations through the stack
7. Throughput        -- Batch operations must scale
8. Concurrency       -- Multiple users generating simultaneously
9. Data Lifecycle    -- Asset versioning, cleanup, export
10. Permissions      -- Project sharing, tenant isolation
11. Security         -- Auth, input validation
12. Human Error      -- Accidental re-generation, delete safety

## Quick Example

```
| Rating | Criteria |
|--------|----------|
| PRODUCTION READY | All 15 categories PASS. No CRITICAL or HIGH findings. |
| CONDITIONALLY READY | No CRITICAL findings. <= 3 categories PARTIAL. Workarounds in place. |
| NOT READY | Any CRITICAL finding. OR > 3 categories FAIL. |
```

```
Minimum (critical categories only, 5 categories): 5-8 business days
Standard (all 15 categories, sequential): 15-20 business days
Accelerated (all 15, 2 auditors parallel): 8-12 business days
Focused (4 categories based on system type): 4-6 business days
```
skilldb get production-audit-skills/production-audit-master-checklistFull skill: 496 lines
Paste into your CLAUDE.md or agent config

Production Audit Master Checklist

Purpose

This meta-skill ties together all 11 production audit skills into a prioritized execution plan. Use this as the starting point for any production audit engagement. It provides the full 15-category checklist, priority ordering for different system types, a reusable audit template, scoring rubric, and timeline estimation.


The 15-Category Production Audit Framework

Category Overview

#CategoryAudit SkillFocus
1Security(external/standard)Authentication, authorization, injection, encryption
2UI/UX Safetyhuman-error-operator-auditDestructive actions, error messaging, navigation safety
3Code Quality(external/standard)Linting, type safety, test coverage, dead code
4Data Integritydata-lifecycle-auditReferential integrity, orphans, cascade correctness
5Reliabilityreliability-resilience-auditTimeouts, fault tolerance, chaos testing
6Throughputthroughput-scale-auditN+1 queries, pagination, load behavior
7Recoveryrecovery-resume-auditInterrupted workflows, checkpoint, resume
8Concurrencyconcurrency-race-condition-auditRace conditions, locking, duplicate prevention
9State Machinestate-machine-auditState definitions, transitions, impossible states
10Idempotencyidempotency-auditRetry safety, webhook dedupe, payment safety
11Observabilityobservability-debuggability-auditLogging, tracing, alerting, health checks
12Cost Controlcost-explosion-auditRunaway spend, retry storms, cache effectiveness
13Permissionspermission-drift-auditAccess control, tenant isolation, role propagation
14Data Lifecycledata-lifecycle-auditDuplication, deletion, export/import, versioning
15Human Errorhuman-error-operator-auditAccidental actions, undo, soft delete, recovery

Priority Ordering by System Type

AI / Media Pipeline (e.g., image generation, video processing, content pipelines)

CRITICAL (Week 1):
  1. Cost Control      -- Paid API calls can bankrupt you overnight
  2. Idempotency       -- Retries must not double-charge
  3. Reliability       -- Long-running jobs must survive failures
  4. Recovery          -- Interrupted pipelines must resume, not restart

HIGH (Week 2):
  5. State Machine     -- Job states must be accurate and reachable
  6. Observability     -- Must trace individual generations through the stack
  7. Throughput        -- Batch operations must scale

MEDIUM (Week 3):
  8. Concurrency       -- Multiple users generating simultaneously
  9. Data Lifecycle    -- Asset versioning, cleanup, export
  10. Permissions      -- Project sharing, tenant isolation

STANDARD (Week 4):
  11. Security         -- Auth, input validation
  12. Human Error      -- Accidental re-generation, delete safety
  13. UI/UX Safety     -- Error messages, progress indicators
  14. Code Quality     -- Technical debt assessment
  15. Data Integrity   -- Referential integrity checks

SaaS Platform (e.g., project management, CRM, collaboration tools)

CRITICAL (Week 1):
  1. Permissions       -- Multi-tenant isolation is existential
  2. Security          -- Auth, session management, injection
  3. Concurrency       -- Multi-user editing, real-time collaboration
  4. State Machine     -- Workflow states drive business logic

HIGH (Week 2):
  5. Data Integrity    -- Referential integrity across entities
  6. Recovery          -- User work must survive interruptions
  7. Observability     -- Multi-tenant debugging requires correlation

MEDIUM (Week 3):
  8. Throughput        -- Dashboard and listing performance at scale
  9. Human Error       -- Undo, soft delete, navigation safety
  10. Idempotency      -- Form re-submission, webhook handling

STANDARD (Week 4):
  11. Reliability      -- Background job resilience
  12. Data Lifecycle   -- Export/import, archive/restore
  13. Cost Control     -- Resource quotas per tenant
  14. UI/UX Safety     -- Error messaging, destructive action guards
  15. Code Quality     -- Technical debt assessment

E-Commerce Platform (e.g., online store, marketplace, subscription service)

CRITICAL (Week 1):
  1. Idempotency       -- Double-charge prevention is existential
  2. Security          -- Payment data, PII protection
  3. State Machine     -- Order lifecycle must be bulletproof
  4. Concurrency       -- Inventory race conditions, flash sales

HIGH (Week 2):
  5. Reliability       -- Checkout flow must not fail silently
  6. Recovery          -- Cart and checkout must survive interruptions
  7. Permissions       -- Customer data isolation

MEDIUM (Week 3):
  8. Throughput        -- Scale for traffic spikes (sales events)
  9. Observability     -- Payment debugging, order tracing
  10. Data Integrity   -- Order-inventory-payment consistency

STANDARD (Week 4):
  11. Cost Control     -- Shipping API costs, email volume
  12. Data Lifecycle   -- Order history, GDPR deletion
  13. Human Error      -- Order cancellation, refund safety
  14. UI/UX Safety     -- Checkout error handling
  15. Code Quality     -- Technical debt assessment

API Platform (e.g., developer API, integration platform, data service)

CRITICAL (Week 1):
  1. Idempotency       -- Clients will retry; every endpoint must be safe
  2. Security          -- API key management, rate limiting, auth
  3. Throughput        -- Performance under concurrent client load
  4. Observability     -- Per-client request tracing, error debugging

HIGH (Week 2):
  5. Concurrency       -- Concurrent writes from multiple clients
  6. Reliability       -- Webhook delivery, async job resilience
  7. Cost Control      -- Per-client quotas, abuse prevention

MEDIUM (Week 3):
  8. State Machine     -- Async operation status lifecycle
  9. Permissions       -- API key scoping, resource ownership
  10. Recovery         -- Long-running async operations

STANDARD (Week 4):
  11. Data Integrity   -- Referential integrity, bulk operations
  12. Data Lifecycle   -- Data retention, export, deletion (GDPR)
  13. Human Error      -- API misuse protection, clear error responses
  14. UI/UX Safety     -- Dashboard and admin panel safety
  15. Code Quality     -- SDK quality, documentation accuracy

Reusable Audit Template

Use this template for each category audit:

# [Category] Audit Report

## System: [System Name]
## Date: [YYYY-MM-DD]
## Auditor: [Name]
## Scope: [What was included/excluded]

## Executive Summary
[2-3 sentences: overall assessment and critical findings]

## Findings

### Finding 1: [Title]
- **Severity:** CRITICAL / HIGH / MEDIUM / LOW
- **Category:** [Which of the 15 categories]
- **Description:** [What was found]
- **Evidence:** [How it was observed/reproduced]
- **Impact:** [What could happen in production]
- **Recommendation:** [How to fix]
- **Effort:** [Estimated hours/days to fix]
- **Status:** OPEN / IN PROGRESS / FIXED / ACCEPTED RISK

### Finding 2: [Title]
[Same structure]

## Test Results

| Test ID | Description | Result | Notes |
|---------|-------------|--------|-------|
| TEST-XX-001 | [What was tested] | PASS / FAIL / PARTIAL | [Details] |
| TEST-XX-002 | [What was tested] | PASS / FAIL / PARTIAL | [Details] |

## Category Score
- **Score:** PASS / PARTIAL / FAIL
- **Critical findings:** [count]
- **High findings:** [count]
- **Medium findings:** [count]

## Recommendations
1. [Prioritized list of remediation actions]
2. ...

## Retest Plan
- [When and how fixes will be verified]

Scoring Rubric

Per-Category Scoring

ScoreCriteriaAction Required
PASSNo CRITICAL or HIGH findings. All core tests pass. Minor issues only.No blocking action. Address MEDIUM findings in normal sprint cycle.
PARTIALNo CRITICAL findings. 1-2 HIGH findings with known workarounds. Most tests pass.Fix HIGH findings within 2 sprints. Workarounds documented.
FAILAny CRITICAL finding. OR 3+ HIGH findings. OR core functionality broken.Stop and fix before next release. May require incident response.

Finding Severity Definitions

SeverityDefinitionResponse Time
CRITICALData loss, security breach, financial impact, or total feature failure likely in production.Fix within 24-48 hours. May warrant hotfix.
HIGHSignificant reliability, data integrity, or user experience issue. Will affect multiple users.Fix within current sprint (1-2 weeks).
MEDIUMModerate issue. Workarounds exist. Affects edge cases or non-critical paths.Fix within next 2 sprints (2-4 weeks).
LOWMinor issue. Cosmetic, optimization, or best-practice deviation.Add to backlog. Fix opportunistically.

Overall System Score

| Rating | Criteria |
|--------|----------|
| PRODUCTION READY | All 15 categories PASS. No CRITICAL or HIGH findings. |
| CONDITIONALLY READY | No CRITICAL findings. <= 3 categories PARTIAL. Workarounds in place. |
| NOT READY | Any CRITICAL finding. OR > 3 categories FAIL. |

Scoring Matrix Template

| # | Category | Score | Critical | High | Medium | Low | Notes |
|---|----------|-------|----------|------|--------|-----|-------|
| 1 | Security | | | | | | |
| 2 | UI/UX Safety | | | | | | |
| 3 | Code Quality | | | | | | |
| 4 | Data Integrity | | | | | | |
| 5 | Reliability | | | | | | |
| 6 | Throughput | | | | | | |
| 7 | Recovery | | | | | | |
| 8 | Concurrency | | | | | | |
| 9 | State Machine | | | | | | |
| 10 | Idempotency | | | | | | |
| 11 | Observability | | | | | | |
| 12 | Cost Control | | | | | | |
| 13 | Permissions | | | | | | |
| 14 | Data Lifecycle | | | | | | |
| 15 | Human Error | | | | | | |
| **OVERALL** | | | | | | | |

Timeline Estimation

Per-Category Audit Duration

CategoryPreparationTestingAnalysis & ReportTotal
Security2h4-8h2h1-1.5 days
UI/UX Safety1h2-4h1h0.5-1 day
Code Quality1h2-4h2h0.5-1 day
Data Integrity2h4-6h2h1-1.5 days
Reliability4h8-16h4h2-3 days
Throughput4h (data seeding)4-8h2h1.5-2 days
Recovery2h4-8h2h1-1.5 days
Concurrency2h4-8h2h1-1.5 days
State Machine2h4-6h2h1-1.5 days
Idempotency2h4-8h2h1-1.5 days
Observability1h2-4h2h0.5-1 day
Cost Control2h4-6h2h1-1.5 days
Permissions2h4-8h2h1-1.5 days
Data Lifecycle2h4-8h2h1-1.5 days
Human Error1h2-4h1h0.5-1 day

Full Audit Timeline

Minimum (critical categories only, 5 categories): 5-8 business days
Standard (all 15 categories, sequential): 15-20 business days
Accelerated (all 15, 2 auditors parallel): 8-12 business days
Focused (4 categories based on system type): 4-6 business days

Recommended Cadence

Full audit: Quarterly (every 3 months) or before major releases
Focused audit: Monthly (rotate through 4 categories per month)
Spot check: After any significant architecture change
Continuous: Automated tests for critical categories in CI/CD

Audit Execution Playbook

Phase 1: Planning (Day 1)

1. Identify system type (AI pipeline, SaaS, e-commerce, API platform)
2. Select priority ordering from the templates above
3. Determine scope (full 15 or focused subset)
4. Gather prerequisites:
   - Architecture diagram
   - Data model / schema
   - Deployment topology
   - External dependency list
   - Incident history (last 6 months)
   - Known technical debt list
5. Set up test environment (staging or dedicated audit environment)
6. Prepare test data (seed scripts from throughput-scale-audit)

Phase 2: Discovery (Days 2-3)

1. Complete reliability-resilience-audit Task 1 (Map Workflows)
2. Complete reliability-resilience-audit Task 2 (Identify High-Risk Steps)
3. Review codebase for patterns:
   - State management approach
   - Error handling patterns
   - External API integration patterns
   - Queue/job processing patterns
4. Output: Risk-prioritized test plan

Phase 3: Testing (Days 4-12)

Execute category audits in priority order.
For each category:
1. Run all concrete test cases from the skill file
2. Document findings with evidence
3. Classify severity
4. Note quick wins (< 2 hours to fix)

Phase 4: Reporting (Days 13-15)

1. Complete scoring matrix for all categories
2. Write executive summary
3. Prioritize findings into:
   - Fix immediately (CRITICAL)
   - Fix this sprint (HIGH)
   - Fix next sprint (MEDIUM)
   - Backlog (LOW)
4. Create remediation plan with effort estimates
5. Present to stakeholders

Phase 5: Remediation Tracking

1. Create tickets for all findings
2. Track fix progress
3. Schedule retest for CRITICAL and HIGH findings
4. Update audit scorecard as fixes are verified

Quick-Start: One-Day Focused Audit

If you only have one day, run this focused assessment:

Morning (4 hours):
  1. Map top 3 critical workflows (reliability-resilience Task 1)
  2. Run idempotency spot checks on payment/billing endpoints
  3. Run state machine audit on the primary job lifecycle
  4. Check for N+1 queries on main listing pages

Afternoon (4 hours):
  5. Test cross-tenant data access (permission-drift, 3 endpoints)
  6. Verify retry safety on one external API workflow
  7. Check error message quality on 5 failure scenarios
  8. Verify delete safety (confirmation + soft delete) on main entities

Output:
  - Top 5 critical findings with severity
  - Quick-win recommendations (< 2h fixes)
  - Full audit recommendation with priority areas

Skill Cross-Reference

Each audit skill references others. Use this map to understand dependencies:

reliability-resilience-audit (comprehensive)
  ├── references: recovery-resume-audit (Task 4, 6)
  ├── references: idempotency-audit (Task 5)
  ├── references: state-machine-audit (Task 9)
  ├── references: observability-debuggability-audit (Task 10)
  └── references: cost-explosion-audit (Task 7 safeguards)

concurrency-race-condition-audit
  ├── references: idempotency-audit (dedupe patterns)
  └── references: state-machine-audit (transition locking)

data-lifecycle-audit
  ├── references: permission-drift-audit (duplication ACLs)
  └── references: state-machine-audit (archive states)

human-error-operator-audit
  ├── references: data-lifecycle-audit (soft delete, versioning)
  └── references: recovery-resume-audit (retry safety from user perspective)

cost-explosion-audit
  ├── references: idempotency-audit (duplicate API calls)
  └── references: reliability-resilience-audit (retry storms)

throughput-scale-audit
  ├── references: observability-debuggability-audit (performance logging)
  └── references: concurrency-race-condition-audit (concurrent load)

Automation Opportunities

Categories that can be partially automated in CI/CD:

HIGH AUTOMATION POTENTIAL:
  [ ] Throughput: Automated load tests in CI (k6, artillery)
  [ ] Idempotency: Automated duplicate-request tests
  [ ] State Machine: Automated transition validation tests
  [ ] Data Integrity: Automated orphan detection queries (scheduled)
  [ ] Code Quality: Static analysis, linting, type checking

MEDIUM AUTOMATION POTENTIAL:
  [ ] Security: SAST/DAST scanning, dependency audit
  [ ] Observability: Automated log format validation
  [ ] Permissions: Automated cross-tenant access tests

LOW AUTOMATION POTENTIAL (manual testing required):
  [ ] Human Error: UX safety review
  [ ] Recovery: Failure injection testing (chaos engineering)
  [ ] Cost Control: Cost estimation review
  [ ] Concurrency: Race condition testing (hard to automate reliably)

What Earlier Audits Miss

Standard checklists verify individual concerns in isolation. This master checklist matters because:

  • Individual audit skills each cover one domain. Without this meta-skill, auditors miss cross-cutting concerns (e.g., a permission bug that causes a cost explosion via unauthorized API access).
  • Ad-hoc audits focus on the area that last caused an incident. This framework ensures systematic coverage so silent risks are not ignored.
  • One-time audits produce a report that goes stale. This framework provides a repeatable cadence with scoring to track improvement.
  • Generic checklists apply the same priority to every system. This framework provides system-type-specific prioritization.
  • Team-dependent knowledge means audits vary by who runs them. This framework standardizes methodology so results are comparable across auditors and time periods.

This would be called a Production Audit Master Checklist -- specifically providing a unified framework for systematically evaluating production readiness across all 15 audit categories with system-type-specific prioritization and scoring.


Risk Pattern Table

PatternWhat It HitsRiskSymptom
No structured audit processAll categoriesHIGHBugs found reactively (by users) instead of proactively
Audit focused only on last incidentCoverage gapsHIGHSame categories tested repeatedly; others never tested
No severity classificationPrioritizationMEDIUMAll findings treated equally; critical bugs not prioritized
Audit results not tracked over timeImprovementMEDIUMSame findings rediscovered each quarter
No system-type prioritizationEfficiencyMEDIUMSaaS system audited with e-commerce priorities
Single auditor, no reviewQualityMEDIUMBlind spots in auditor's expertise areas
No retest after remediationVerificationHIGHFixes assumed correct; regressions not caught
Audit scope too narrowCoverageHIGHOnly API tested; background jobs, workers, and admin panels skipped

Final Notes

This checklist is a living document. After each audit:

  1. Update the scoring matrix with current results.
  2. Track improvement over time (quarter-over-quarter).
  3. Add system-specific test cases discovered during testing.
  4. Document false positives to avoid re-investigation.
  5. Share findings across teams to prevent the same bugs in other systems.

The goal is not perfection on day one. The goal is systematic improvement with clear visibility into what is safe and what is not.

Install this skill directly: skilldb add production-audit-skills

Get CLI access →