Film & TelevisionProduction Audit496 lines

Production Audit Master Checklist

Quick Summary35 lines

This meta-skill ties together all 11 production audit skills into a prioritized execution plan. Use this as the starting point for any production audit engagement. It provides the full 15-category checklist, priority ordering for different system types, a reusable audit template, scoring rubric, and timeline estimation.

## Key Points

1. Cost Control      -- Paid API calls can bankrupt you overnight
2. Idempotency       -- Retries must not double-charge
3. Reliability       -- Long-running jobs must survive failures
4. Recovery          -- Interrupted pipelines must resume, not restart
5. State Machine     -- Job states must be accurate and reachable
6. Observability     -- Must trace individual generations through the stack
7. Throughput        -- Batch operations must scale
8. Concurrency       -- Multiple users generating simultaneously
9. Data Lifecycle    -- Asset versioning, cleanup, export
10. Permissions      -- Project sharing, tenant isolation
11. Security         -- Auth, input validation
12. Human Error      -- Accidental re-generation, delete safety

## Quick Example

```
| Rating | Criteria |
|--------|----------|
| PRODUCTION READY | All 15 categories PASS. No CRITICAL or HIGH findings. |
| CONDITIONALLY READY | No CRITICAL findings. <= 3 categories PARTIAL. Workarounds in place. |
| NOT READY | Any CRITICAL finding. OR > 3 categories FAIL. |
```

```
Minimum (critical categories only, 5 categories): 5-8 business days
Standard (all 15 categories, sequential): 15-20 business days
Accelerated (all 15, 2 auditors parallel): 8-12 business days
Focused (4 categories based on system type): 4-6 business days
```

skilldb get production-audit-skills/production-audit-master-checklistFull skill: 496 lines

Paste into your CLAUDE.md or agent config

Production Audit Master Checklist

Purpose

This meta-skill ties together all 11 production audit skills into a prioritized execution plan. Use this as the starting point for any production audit engagement. It provides the full 15-category checklist, priority ordering for different system types, a reusable audit template, scoring rubric, and timeline estimation.

The 15-Category Production Audit Framework

Category Overview

#	Category	Audit Skill	Focus
1	Security	(external/standard)	Authentication, authorization, injection, encryption
2	UI/UX Safety	human-error-operator-audit	Destructive actions, error messaging, navigation safety
3	Code Quality	(external/standard)	Linting, type safety, test coverage, dead code
4	Data Integrity	data-lifecycle-audit	Referential integrity, orphans, cascade correctness
5	Reliability	reliability-resilience-audit	Timeouts, fault tolerance, chaos testing
6	Throughput	throughput-scale-audit	N+1 queries, pagination, load behavior
7	Recovery	recovery-resume-audit	Interrupted workflows, checkpoint, resume
8	Concurrency	concurrency-race-condition-audit	Race conditions, locking, duplicate prevention
9	State Machine	state-machine-audit	State definitions, transitions, impossible states
10	Idempotency	idempotency-audit	Retry safety, webhook dedupe, payment safety
11	Observability	observability-debuggability-audit	Logging, tracing, alerting, health checks
12	Cost Control	cost-explosion-audit	Runaway spend, retry storms, cache effectiveness
13	Permissions	permission-drift-audit	Access control, tenant isolation, role propagation
14	Data Lifecycle	data-lifecycle-audit	Duplication, deletion, export/import, versioning
15	Human Error	human-error-operator-audit	Accidental actions, undo, soft delete, recovery

Priority Ordering by System Type

AI / Media Pipeline (e.g., image generation, video processing, content pipelines)

CRITICAL (Week 1):
  1. Cost Control      -- Paid API calls can bankrupt you overnight
  2. Idempotency       -- Retries must not double-charge
  3. Reliability       -- Long-running jobs must survive failures
  4. Recovery          -- Interrupted pipelines must resume, not restart

HIGH (Week 2):
  5. State Machine     -- Job states must be accurate and reachable
  6. Observability     -- Must trace individual generations through the stack
  7. Throughput        -- Batch operations must scale

MEDIUM (Week 3):
  8. Concurrency       -- Multiple users generating simultaneously
  9. Data Lifecycle    -- Asset versioning, cleanup, export
  10. Permissions      -- Project sharing, tenant isolation

STANDARD (Week 4):
  11. Security         -- Auth, input validation
  12. Human Error      -- Accidental re-generation, delete safety
  13. UI/UX Safety     -- Error messages, progress indicators
  14. Code Quality     -- Technical debt assessment
  15. Data Integrity   -- Referential integrity checks

SaaS Platform (e.g., project management, CRM, collaboration tools)

CRITICAL (Week 1):
  1. Permissions       -- Multi-tenant isolation is existential
  2. Security          -- Auth, session management, injection
  3. Concurrency       -- Multi-user editing, real-time collaboration
  4. State Machine     -- Workflow states drive business logic

HIGH (Week 2):
  5. Data Integrity    -- Referential integrity across entities
  6. Recovery          -- User work must survive interruptions
  7. Observability     -- Multi-tenant debugging requires correlation

MEDIUM (Week 3):
  8. Throughput        -- Dashboard and listing performance at scale
  9. Human Error       -- Undo, soft delete, navigation safety
  10. Idempotency      -- Form re-submission, webhook handling

STANDARD (Week 4):
  11. Reliability      -- Background job resilience
  12. Data Lifecycle   -- Export/import, archive/restore
  13. Cost Control     -- Resource quotas per tenant
  14. UI/UX Safety     -- Error messaging, destructive action guards
  15. Code Quality     -- Technical debt assessment

E-Commerce Platform (e.g., online store, marketplace, subscription service)

CRITICAL (Week 1):
  1. Idempotency       -- Double-charge prevention is existential
  2. Security          -- Payment data, PII protection
  3. State Machine     -- Order lifecycle must be bulletproof
  4. Concurrency       -- Inventory race conditions, flash sales

HIGH (Week 2):
  5. Reliability       -- Checkout flow must not fail silently
  6. Recovery          -- Cart and checkout must survive interruptions
  7. Permissions       -- Customer data isolation

MEDIUM (Week 3):
  8. Throughput        -- Scale for traffic spikes (sales events)
  9. Observability     -- Payment debugging, order tracing
  10. Data Integrity   -- Order-inventory-payment consistency

STANDARD (Week 4):
  11. Cost Control     -- Shipping API costs, email volume
  12. Data Lifecycle   -- Order history, GDPR deletion
  13. Human Error      -- Order cancellation, refund safety
  14. UI/UX Safety     -- Checkout error handling
  15. Code Quality     -- Technical debt assessment

API Platform (e.g., developer API, integration platform, data service)

CRITICAL (Week 1):
  1. Idempotency       -- Clients will retry; every endpoint must be safe
  2. Security          -- API key management, rate limiting, auth
  3. Throughput        -- Performance under concurrent client load
  4. Observability     -- Per-client request tracing, error debugging

HIGH (Week 2):
  5. Concurrency       -- Concurrent writes from multiple clients
  6. Reliability       -- Webhook delivery, async job resilience
  7. Cost Control      -- Per-client quotas, abuse prevention

MEDIUM (Week 3):
  8. State Machine     -- Async operation status lifecycle
  9. Permissions       -- API key scoping, resource ownership
  10. Recovery         -- Long-running async operations

STANDARD (Week 4):
  11. Data Integrity   -- Referential integrity, bulk operations
  12. Data Lifecycle   -- Data retention, export, deletion (GDPR)
  13. Human Error      -- API misuse protection, clear error responses
  14. UI/UX Safety     -- Dashboard and admin panel safety
  15. Code Quality     -- SDK quality, documentation accuracy

Reusable Audit Template

Use this template for each category audit:

# [Category] Audit Report

## System: [System Name]
## Date: [YYYY-MM-DD]
## Auditor: [Name]
## Scope: [What was included/excluded]

## Executive Summary
[2-3 sentences: overall assessment and critical findings]

## Findings

### Finding 1: [Title]
- **Severity:** CRITICAL / HIGH / MEDIUM / LOW
- **Category:** [Which of the 15 categories]
- **Description:** [What was found]
- **Evidence:** [How it was observed/reproduced]
- **Impact:** [What could happen in production]
- **Recommendation:** [How to fix]
- **Effort:** [Estimated hours/days to fix]
- **Status:** OPEN / IN PROGRESS / FIXED / ACCEPTED RISK

### Finding 2: [Title]
[Same structure]

## Test Results

| Test ID | Description | Result | Notes |
|---------|-------------|--------|-------|
| TEST-XX-001 | [What was tested] | PASS / FAIL / PARTIAL | [Details] |
| TEST-XX-002 | [What was tested] | PASS / FAIL / PARTIAL | [Details] |

## Category Score
- **Score:** PASS / PARTIAL / FAIL
- **Critical findings:** [count]
- **High findings:** [count]
- **Medium findings:** [count]

## Recommendations
1. [Prioritized list of remediation actions]
2. ...

## Retest Plan
- [When and how fixes will be verified]

Scoring Rubric

Per-Category Scoring

Score	Criteria	Action Required
PASS	No CRITICAL or HIGH findings. All core tests pass. Minor issues only.	No blocking action. Address MEDIUM findings in normal sprint cycle.
PARTIAL	No CRITICAL findings. 1-2 HIGH findings with known workarounds. Most tests pass.	Fix HIGH findings within 2 sprints. Workarounds documented.
FAIL	Any CRITICAL finding. OR 3+ HIGH findings. OR core functionality broken.	Stop and fix before next release. May require incident response.

Finding Severity Definitions

Severity	Definition	Response Time
CRITICAL	Data loss, security breach, financial impact, or total feature failure likely in production.	Fix within 24-48 hours. May warrant hotfix.
HIGH	Significant reliability, data integrity, or user experience issue. Will affect multiple users.	Fix within current sprint (1-2 weeks).
MEDIUM	Moderate issue. Workarounds exist. Affects edge cases or non-critical paths.	Fix within next 2 sprints (2-4 weeks).
LOW	Minor issue. Cosmetic, optimization, or best-practice deviation.	Add to backlog. Fix opportunistically.

Overall System Score

| Rating | Criteria |
|--------|----------|
| PRODUCTION READY | All 15 categories PASS. No CRITICAL or HIGH findings. |
| CONDITIONALLY READY | No CRITICAL findings. <= 3 categories PARTIAL. Workarounds in place. |
| NOT READY | Any CRITICAL finding. OR > 3 categories FAIL. |

Scoring Matrix Template

| # | Category | Score | Critical | High | Medium | Low | Notes |
|---|----------|-------|----------|------|--------|-----|-------|
| 1 | Security | | | | | | |
| 2 | UI/UX Safety | | | | | | |
| 3 | Code Quality | | | | | | |
| 4 | Data Integrity | | | | | | |
| 5 | Reliability | | | | | | |
| 6 | Throughput | | | | | | |
| 7 | Recovery | | | | | | |
| 8 | Concurrency | | | | | | |
| 9 | State Machine | | | | | | |
| 10 | Idempotency | | | | | | |
| 11 | Observability | | | | | | |
| 12 | Cost Control | | | | | | |
| 13 | Permissions | | | | | | |
| 14 | Data Lifecycle | | | | | | |
| 15 | Human Error | | | | | | |
| **OVERALL** | | | | | | | |

Timeline Estimation

Per-Category Audit Duration

Category	Preparation	Testing	Analysis & Report	Total
Security	2h	4-8h	2h	1-1.5 days
UI/UX Safety	1h	2-4h	1h	0.5-1 day
Code Quality	1h	2-4h	2h	0.5-1 day
Data Integrity	2h	4-6h	2h	1-1.5 days
Reliability	4h	8-16h	4h	2-3 days
Throughput	4h (data seeding)	4-8h	2h	1.5-2 days
Recovery	2h	4-8h	2h	1-1.5 days
Concurrency	2h	4-8h	2h	1-1.5 days
State Machine	2h	4-6h	2h	1-1.5 days
Idempotency	2h	4-8h	2h	1-1.5 days
Observability	1h	2-4h	2h	0.5-1 day
Cost Control	2h	4-6h	2h	1-1.5 days
Permissions	2h	4-8h	2h	1-1.5 days
Data Lifecycle	2h	4-8h	2h	1-1.5 days
Human Error	1h	2-4h	1h	0.5-1 day

Full Audit Timeline

Minimum (critical categories only, 5 categories): 5-8 business days
Standard (all 15 categories, sequential): 15-20 business days
Accelerated (all 15, 2 auditors parallel): 8-12 business days
Focused (4 categories based on system type): 4-6 business days

Recommended Cadence

Full audit: Quarterly (every 3 months) or before major releases
Focused audit: Monthly (rotate through 4 categories per month)
Spot check: After any significant architecture change
Continuous: Automated tests for critical categories in CI/CD

Audit Execution Playbook

Phase 1: Planning (Day 1)

1. Identify system type (AI pipeline, SaaS, e-commerce, API platform)
2. Select priority ordering from the templates above
3. Determine scope (full 15 or focused subset)
4. Gather prerequisites:
   - Architecture diagram
   - Data model / schema
   - Deployment topology
   - External dependency list
   - Incident history (last 6 months)
   - Known technical debt list
5. Set up test environment (staging or dedicated audit environment)
6. Prepare test data (seed scripts from throughput-scale-audit)

Phase 2: Discovery (Days 2-3)

1. Complete reliability-resilience-audit Task 1 (Map Workflows)
2. Complete reliability-resilience-audit Task 2 (Identify High-Risk Steps)
3. Review codebase for patterns:
   - State management approach
   - Error handling patterns
   - External API integration patterns
   - Queue/job processing patterns
4. Output: Risk-prioritized test plan

Phase 3: Testing (Days 4-12)

Execute category audits in priority order.
For each category:
1. Run all concrete test cases from the skill file
2. Document findings with evidence
3. Classify severity
4. Note quick wins (< 2 hours to fix)

Phase 4: Reporting (Days 13-15)

1. Complete scoring matrix for all categories
2. Write executive summary
3. Prioritize findings into:
   - Fix immediately (CRITICAL)
   - Fix this sprint (HIGH)
   - Fix next sprint (MEDIUM)
   - Backlog (LOW)
4. Create remediation plan with effort estimates
5. Present to stakeholders

Phase 5: Remediation Tracking

1. Create tickets for all findings
2. Track fix progress
3. Schedule retest for CRITICAL and HIGH findings
4. Update audit scorecard as fixes are verified

Quick-Start: One-Day Focused Audit

If you only have one day, run this focused assessment:

Morning (4 hours):
  1. Map top 3 critical workflows (reliability-resilience Task 1)
  2. Run idempotency spot checks on payment/billing endpoints
  3. Run state machine audit on the primary job lifecycle
  4. Check for N+1 queries on main listing pages

Afternoon (4 hours):
  5. Test cross-tenant data access (permission-drift, 3 endpoints)
  6. Verify retry safety on one external API workflow
  7. Check error message quality on 5 failure scenarios
  8. Verify delete safety (confirmation + soft delete) on main entities

Output:
  - Top 5 critical findings with severity
  - Quick-win recommendations (< 2h fixes)
  - Full audit recommendation with priority areas

Skill Cross-Reference

Each audit skill references others. Use this map to understand dependencies:

reliability-resilience-audit (comprehensive)
  ├── references: recovery-resume-audit (Task 4, 6)
  ├── references: idempotency-audit (Task 5)
  ├── references: state-machine-audit (Task 9)
  ├── references: observability-debuggability-audit (Task 10)
  └── references: cost-explosion-audit (Task 7 safeguards)

concurrency-race-condition-audit
  ├── references: idempotency-audit (dedupe patterns)
  └── references: state-machine-audit (transition locking)

data-lifecycle-audit
  ├── references: permission-drift-audit (duplication ACLs)
  └── references: state-machine-audit (archive states)

human-error-operator-audit
  ├── references: data-lifecycle-audit (soft delete, versioning)
  └── references: recovery-resume-audit (retry safety from user perspective)

cost-explosion-audit
  ├── references: idempotency-audit (duplicate API calls)
  └── references: reliability-resilience-audit (retry storms)

throughput-scale-audit
  ├── references: observability-debuggability-audit (performance logging)
  └── references: concurrency-race-condition-audit (concurrent load)

Automation Opportunities

Categories that can be partially automated in CI/CD:

HIGH AUTOMATION POTENTIAL:
  [ ] Throughput: Automated load tests in CI (k6, artillery)
  [ ] Idempotency: Automated duplicate-request tests
  [ ] State Machine: Automated transition validation tests
  [ ] Data Integrity: Automated orphan detection queries (scheduled)
  [ ] Code Quality: Static analysis, linting, type checking

MEDIUM AUTOMATION POTENTIAL:
  [ ] Security: SAST/DAST scanning, dependency audit
  [ ] Observability: Automated log format validation
  [ ] Permissions: Automated cross-tenant access tests

LOW AUTOMATION POTENTIAL (manual testing required):
  [ ] Human Error: UX safety review
  [ ] Recovery: Failure injection testing (chaos engineering)
  [ ] Cost Control: Cost estimation review
  [ ] Concurrency: Race condition testing (hard to automate reliably)

What Earlier Audits Miss

Standard checklists verify individual concerns in isolation. This master checklist matters because:

Individual audit skills each cover one domain. Without this meta-skill, auditors miss cross-cutting concerns (e.g., a permission bug that causes a cost explosion via unauthorized API access).
Ad-hoc audits focus on the area that last caused an incident. This framework ensures systematic coverage so silent risks are not ignored.
One-time audits produce a report that goes stale. This framework provides a repeatable cadence with scoring to track improvement.
Generic checklists apply the same priority to every system. This framework provides system-type-specific prioritization.
Team-dependent knowledge means audits vary by who runs them. This framework standardizes methodology so results are comparable across auditors and time periods.

This would be called a Production Audit Master Checklist -- specifically providing a unified framework for systematically evaluating production readiness across all 15 audit categories with system-type-specific prioritization and scoring.

Risk Pattern Table

Pattern	What It Hits	Risk	Symptom
No structured audit process	All categories	HIGH	Bugs found reactively (by users) instead of proactively
Audit focused only on last incident	Coverage gaps	HIGH	Same categories tested repeatedly; others never tested
No severity classification	Prioritization	MEDIUM	All findings treated equally; critical bugs not prioritized
Audit results not tracked over time	Improvement	MEDIUM	Same findings rediscovered each quarter
No system-type prioritization	Efficiency	MEDIUM	SaaS system audited with e-commerce priorities
Single auditor, no review	Quality	MEDIUM	Blind spots in auditor's expertise areas
No retest after remediation	Verification	HIGH	Fixes assumed correct; regressions not caught
Audit scope too narrow	Coverage	HIGH	Only API tested; background jobs, workers, and admin panels skipped

Final Notes

This checklist is a living document. After each audit:

Update the scoring matrix with current results.
Track improvement over time (quarter-over-quarter).
Add system-specific test cases discovered during testing.
Document false positives to avoid re-investigation.
Share findings across teams to prevent the same bugs in other systems.

The goal is not perfection on day one. The goal is systematic improvement with clear visibility into what is safe and what is not.

Install this skill directly: skilldb add production-audit-skills

Get CLI access →

Purpose

The 15-Category Production Audit Framework

Category Overview

Priority Ordering by System Type

AI / Media Pipeline (e.g., image generation, video processing, content pipelines)

SaaS Platform (e.g., project management, CRM, collaboration tools)

E-Commerce Platform (e.g., online store, marketplace, subscription service)

API Platform (e.g., developer API, integration platform, data service)

Reusable Audit Template

[Category] Audit Report

System: [System Name]

Date: [YYYY-MM-DD]

Auditor: [Name]

Scope: [What was included/excluded]

Executive Summary

Findings

Finding 1: [Title]

Finding 2: [Title]

Test Results

Category Score

Recommendations

Retest Plan

Scoring Rubric

Per-Category Scoring

Finding Severity Definitions

Overall System Score

Scoring Matrix Template

Timeline Estimation

Per-Category Audit Duration

Full Audit Timeline

Recommended Cadence

Audit Execution Playbook

Phase 1: Planning (Day 1)

Phase 2: Discovery (Days 2-3)

Phase 3: Testing (Days 4-12)

Phase 4: Reporting (Days 13-15)

Phase 5: Remediation Tracking

Quick-Start: One-Day Focused Audit

Skill Cross-Reference

Automation Opportunities

What Earlier Audits Miss

Risk Pattern Table

Final Notes

Details

Pack: production-audit-skills
File: production-audit-master-checklist.md
Lines: 496
Category: Film & Television

Download via CLI

Pro

$ skilldb add production-audit-skills

Installs the full Production Audit pack to your project.

Production Audit Master Checklist

Production Audit Master Checklist

Purpose

The 15-Category Production Audit Framework

Category Overview

Priority Ordering by System Type

AI / Media Pipeline (e.g., image generation, video processing, content pipelines)

SaaS Platform (e.g., project management, CRM, collaboration tools)

E-Commerce Platform (e.g., online store, marketplace, subscription service)

API Platform (e.g., developer API, integration platform, data service)

Reusable Audit Template

Scoring Rubric

Per-Category Scoring

Finding Severity Definitions

Overall System Score

Scoring Matrix Template

Timeline Estimation

Per-Category Audit Duration

Full Audit Timeline

Recommended Cadence

Audit Execution Playbook

Phase 1: Planning (Day 1)

Phase 2: Discovery (Days 2-3)

Phase 3: Testing (Days 4-12)

Phase 4: Reporting (Days 13-15)

Phase 5: Remediation Tracking

Quick-Start: One-Day Focused Audit

Skill Cross-Reference

Automation Opportunities

What Earlier Audits Miss

Risk Pattern Table

Final Notes

Related Skills

Concurrency & Race Condition Audit

Cost Explosion Audit

Data Lifecycle Audit

Human Error & Operator Safety Audit

Idempotency Audit

Observability & Debuggability Audit