Skip to main content
Film & TelevisionProduction Audit449 lines

Cost Explosion Audit

Quick Summary36 lines
Verify that bugs, retries, and design flaws cannot create runaway spend. In systems that call paid external APIs (AI providers, cloud storage, SMS, email), a single bug can turn a $10 operation into a $10,000 incident. This audit identifies every path where costs can explode and verifies that safeguards exist.

## Key Points

1. Run a generation pipeline on a project (e.g., 10 assets).
2. Record: number of external API calls, total tokens/compute units, estimated cost.
3. Run the exact same pipeline again (same project, same parameters).
4. Record: number of additional external API calls.
- [ ] Second run makes ZERO additional API calls (results cached/reused).
- [ ] OR second run is blocked ("Assets already generated. Regenerate?").
- [ ] OR second run costs less than 10% of first (only changed items regenerated).
- Second run makes the same number of API calls as the first.
- No caching mechanism exists.
- User is not warned about cost of regeneration.
1. Simulate a persistent external API failure (mock 500 or timeout).
2. Trigger a pipeline that calls this API.

## Quick Example

```
| Run | API Calls | Tokens Used | Estimated Cost | New Assets | From Cache |
|-----|-----------|-------------|---------------|------------|------------|
| 1st | 10        | 15,000      | $0.45         | 10         | 0          |
| 2nd | 0         | 0           | $0.00         | 0          | 10         | <- PASS
| 2nd | 10        | 15,000      | $0.45         | 10         | 0          | <- FAIL
```

```
| Resource | Min Instances | Max Instances | Scale Metric | Cost at Max | Acceptable? |
|----------|-------------|-------------|-------------|------------|-------------|
| API server | 1 | 10 | CPU > 70% | $X/hour | [ ] Yes [ ] No |
| Workers | 1 | 5 | Queue depth > 50 | $X/hour | [ ] Yes [ ] No |
| Database | 1 | 1 (fixed) | N/A | $X/month | [ ] Yes [ ] No |
```
skilldb get production-audit-skills/cost-explosion-auditFull skill: 449 lines
Paste into your CLAUDE.md or agent config

Cost Explosion Audit

Purpose

Verify that bugs, retries, and design flaws cannot create runaway spend. In systems that call paid external APIs (AI providers, cloud storage, SMS, email), a single bug can turn a $10 operation into a $10,000 incident. This audit identifies every path where costs can explode and verifies that safeguards exist.


Scope

Cost VectorWhat We Test
External API callsDuplicate calls, retry storms, unnecessary invocations
Cloud computeRunaway processes, zombie instances, autoscale limits
StorageUnbounded growth, duplicate files, missing cleanup
BandwidthLarge payload transfers, unnecessary downloads
Third-party servicesSMS/email volume, search API calls, CDN bandwidth
DatabaseOversized queries, missing indexes causing full scans

Risk Pattern Table

PatternWhat It HitsRiskSymptom
Infinite retry loopAPI costs, computeCRITICALFailed call retries forever; provider bill spikes
Duplicate generation on retryAPI costsHIGHUser retries, system generates twice, charged twice
Preview triggers paid APIAPI costsHIGHEvery preview click calls the expensive API
No caching of API resultsAPI costsHIGHSame prompt re-generated repeatedly instead of cached
Storage never cleaned upStorage costsMEDIUMOrphaned files accumulate; storage bill grows monotonically
Autoscale without ceilingCompute costsCRITICALTraffic spike scales to 100 instances; $1000/hour
Batch retry = full re-runAPI costsHIGH1 of 20 items fails; retry re-runs all 20
Verbose logging in productionStorage, bandwidthMEDIUMDEBUG-level logs consume GB/day of log storage
Missing query paginationCompute, DBMEDIUMFull-table scans on every request; DB costs spike
Webhook retry causing re-processingAPI costsHIGHProvider retries webhook; system re-generates assets

Concrete Test Cases

TEST-CE-001: Pipeline Double-Run Cost Comparison

Objective: Verify that running a pipeline twice does not double the billable API calls.

Steps:

  1. Run a generation pipeline on a project (e.g., 10 assets).
  2. Record: number of external API calls, total tokens/compute units, estimated cost.
  3. Run the exact same pipeline again (same project, same parameters).
  4. Record: number of additional external API calls.

Pass Criteria:

  • Second run makes ZERO additional API calls (results cached/reused).
  • OR second run is blocked ("Assets already generated. Regenerate?").
  • OR second run costs less than 10% of first (only changed items regenerated).

Fail Criteria:

  • Second run makes the same number of API calls as the first.
  • No caching mechanism exists.
  • User is not warned about cost of regeneration.

Cost Tracking Template:

| Run | API Calls | Tokens Used | Estimated Cost | New Assets | From Cache |
|-----|-----------|-------------|---------------|------------|------------|
| 1st | 10        | 15,000      | $0.45         | 10         | 0          |
| 2nd | 0         | 0           | $0.00         | 0          | 10         | <- PASS
| 2nd | 10        | 15,000      | $0.45         | 10         | 0          | <- FAIL

TEST-CE-002: Retry Ceiling on External API Failures

Objective: Verify that failed external API calls have a bounded retry limit.

Steps:

  1. Simulate a persistent external API failure (mock 500 or timeout).
  2. Trigger a pipeline that calls this API.
  3. Observe retry behavior.
  4. Count total API calls made.

Pass Criteria:

  • Maximum retry count is configured (e.g., 3 retries).
  • Exponential backoff between retries.
  • After max retries, job moves to "failed" state (not infinite retry).
  • Total API calls = initial + max_retries (e.g., 4 total for 3 retries).
  • Retry metrics are logged (attempt number, delay, error).

Fail Criteria:

  • No retry limit; calls continue indefinitely.
  • Linear retry (no backoff); thundering herd on provider recovery.
  • Retry count > 10 for a single item.
  • Retries continue even after circuit breaker should have tripped.

Retry Configuration Audit:

| Operation | Max Retries | Backoff | Base Delay | Max Delay | Circuit Breaker |
|-----------|------------|---------|-----------|-----------|-----------------|
| AI generation | 3 | Exponential | 1s | 30s | After 5 failures in 1min |
| Storage upload | 3 | Exponential | 500ms | 10s | N/A |
| Webhook delivery | 5 | Exponential | 1s | 5min | N/A |
| Payment API | 2 | Exponential | 2s | 10s | After 3 failures in 1min |
| Email send | 3 | Linear | 5s | 15s | After 10 failures in 5min |

TEST-CE-003: Cache Effectiveness for API Results

Objective: Verify that cacheable API results are actually cached and reused.

Steps:

  1. Make an API call with specific parameters. Record the result.
  2. Make the same call with the same parameters.
  3. Verify: was the cached result used, or was a new API call made?

Cache Verification Points:

[ ] Generation results cached by (input_hash, model, parameters)
[ ] Cache checked BEFORE making external API call
[ ] Cache hit rate logged and monitored
[ ] Cache invalidation works (changed parameters = new call)
[ ] Cache has TTL appropriate for content type
[ ] Cache storage cost is less than API re-call cost

Cache Hit Rate Targets:

| Content Type | Target Cache Hit Rate | Cache TTL |
|-------------|----------------------|-----------|
| Image generation (same prompt) | > 95% | 30 days |
| Text generation (same prompt) | > 90% | 7 days |
| Search/lookup results | > 80% | 1 hour |
| User profile/permissions | > 95% | 5 minutes |
| Configuration/settings | > 99% | 1 minute |

TEST-CE-004: Preview vs Production API Usage

Objective: Verify that preview/draft operations do not invoke paid APIs.

Steps:

  1. Navigate through all preview/draft features in the UI.
  2. Monitor external API calls during preview actions.
  3. Compare with production/generate actions.

Pass Criteria:

  • Previews use free/cheap alternatives (thumbnails, placeholders, cached results).
  • Preview actions are clearly distinct from generate actions in code.
  • No paid API call is triggered without explicit user confirmation.
  • Draft saves do not trigger generation pipelines.

Fail Criteria:

  • Loading a preview page triggers a paid API call.
  • Typing in a prompt field triggers live API calls (without debounce/explicit submit).
  • Auto-save triggers regeneration.

TEST-CE-005: Storage Growth Analysis

Objective: Verify that storage usage is bounded and orphaned files are cleaned up.

Steps:

  1. Record current storage usage.
  2. Create and delete 10 projects with assets.
  3. Record storage usage after deletion.
  4. Wait for any cleanup jobs to run.
  5. Record final storage usage.

Pass Criteria:

  • Storage after create+delete is within 10% of starting usage.
  • Orphaned files are cleaned up by automated process.
  • Cleanup job runs on a schedule (daily or weekly).
  • Temporary files (uploads, processing intermediates) are cleaned up within 24 hours.
  • Storage growth rate is tracked and alerted on anomalies.

Storage Audit Checklist:

[ ] Every file creation has a corresponding deletion path
[ ] Temporary files have TTL or cleanup job
[ ] Failed upload files are cleaned up
[ ] Replaced assets have old versions cleaned up (or explicitly versioned)
[ ] Deleted entity assets are cleaned up (cascade delete or cleanup job)
[ ] Storage usage is monitored per tenant/project
[ ] Storage quotas exist per tenant (prevent single user from consuming TB)

TEST-CE-006: Autoscale Ceiling Verification

Objective: Verify that autoscaling has maximum limits to prevent cost explosion.

Steps:

  1. Check autoscale configuration for all compute resources.
  2. Verify maximum instance/replica count is configured.
  3. Calculate worst-case cost at maximum scale.
  4. Verify this cost is acceptable.

Autoscale Audit:

| Resource | Min Instances | Max Instances | Scale Metric | Cost at Max | Acceptable? |
|----------|-------------|-------------|-------------|------------|-------------|
| API server | 1 | 10 | CPU > 70% | $X/hour | [ ] Yes [ ] No |
| Workers | 1 | 5 | Queue depth > 50 | $X/hour | [ ] Yes [ ] No |
| Database | 1 | 1 (fixed) | N/A | $X/month | [ ] Yes [ ] No |

Pass Criteria:

  • Every autoscaling resource has a maximum limit.
  • Maximum scale cost is known and documented.
  • Alerts fire before reaching maximum scale.
  • Scale-down is configured (not just scale-up).
  • Billing alerts configured at 50%, 80%, 100% of budget.

TEST-CE-007: Per-User / Per-Tenant Cost Isolation

Objective: Verify that one user's activity cannot cause unbounded costs.

Steps:

  1. Check: are there per-user rate limits on expensive operations?
  2. Check: are there per-user quotas on resource consumption?
  3. Simulate: one user triggers 100 generations in rapid succession.

Pass Criteria:

  • Rate limits exist on all paid API triggers (e.g., 10 generations/hour/user).
  • Quotas exist on storage (e.g., 10GB per project).
  • Exceeding limits returns clear message (not silent failure or cost pass-through).
  • Admin can view per-user resource consumption.

Worst-Case Cost Estimation Methodology

For every paid operation, calculate:

Worst Case = (max_retries + 1) * cost_per_call * max_concurrent_operations * max_items_per_operation

Example: AI Image Generation
  cost_per_call = $0.04 (DALL-E 3)
  max_retries = 3
  max_concurrent_operations = 5 (users)
  max_items_per_operation = 100 (assets per project)

  Worst case per incident = (3+1) * $0.04 * 5 * 100 = $80

  With infinite retry bug:
  Worst case = UNBOUNDED (retries forever until provider blocks or budget exhausted)

Cost Estimation Template:

| Operation | Per-Call Cost | Max Retries | Max Concurrent | Max Items | Normal Cost | Worst Case | With Bug |
|-----------|-------------|------------|---------------|-----------|-------------|------------|----------|
| AI generation | $0.04 | 3 | 5 | 100 | $20 | $80 | UNBOUNDED |
| Storage upload | $0.001 | 3 | 10 | 200 | $0.20 | $0.80 | ~$1 |
| Email send | $0.001 | 3 | 50 | 1 | $0.05 | $0.20 | ~$50/day |
| SMS notify | $0.05 | 2 | 10 | 1 | $0.50 | $1.50 | ~$500/day |

Circuit Breaker Implementation

When an external service fails repeatedly, stop calling it:

States:
  CLOSED (normal): Calls go through. Track failure count.
  OPEN (tripped): All calls immediately fail without making external call.
  HALF-OPEN (testing): Allow one test call to check if service recovered.

Configuration:
  failure_threshold: 5 failures in 60 seconds -> OPEN
  open_duration: 30 seconds -> HALF-OPEN
  success_threshold: 2 successes in HALF-OPEN -> CLOSED

Benefits:
  - Prevents cost accumulation during outage
  - Reduces load on struggling external service
  - Fast failure for users (no waiting for timeout)
  - Automatic recovery when service returns

Billing Alert Template

alerts:
  - name: "daily_api_cost_warning"
    condition: "daily_api_spend > $50"
    severity: "warning"
    action: "notify_team_channel"

  - name: "daily_api_cost_critical"
    condition: "daily_api_spend > $200"
    severity: "critical"
    action: "page_oncall + disable_generation"

  - name: "per_user_spend_anomaly"
    condition: "user_daily_spend > 10x user_average"
    severity: "warning"
    action: "rate_limit_user + notify_team"

  - name: "retry_storm_detected"
    condition: "retry_count > 100 in 5 minutes"
    severity: "critical"
    action: "trip_circuit_breaker + page_oncall"

  - name: "storage_growth_anomaly"
    condition: "daily_storage_delta > 10GB"
    severity: "warning"
    action: "notify_team"

Post-Audit Checklist

[ ] All external API calls have retry limits (max 3-5)
[ ] Exponential backoff configured on all retries
[ ] Circuit breaker implemented for external services
[ ] Results cached where possible (cache before API call)
[ ] Preview/draft operations do not trigger paid APIs
[ ] Storage cleanup automated for orphaned/temporary files
[ ] Autoscale has maximum limits on all resources
[ ] Per-user rate limits on expensive operations
[ ] Billing alerts configured at 50%, 80%, 100% of budget
[ ] Worst-case cost estimated and documented for each operation
[ ] Duplicate pipeline runs are blocked or use cached results
[ ] Batch retry targets only failed items (not full batch)
[ ] Log verbosity is appropriate for production (not DEBUG)
[ ] Storage quotas exist per tenant
[ ] Cost per operation is logged and aggregable

What Earlier Audits Miss

Standard testing verifies functionality and performance. This audit matters because:

  • Unit tests mock external APIs. They never discover that a retry bug calls the real API 1,000 times.
  • Integration tests run against sandboxes with unlimited quotas. They never catch missing rate limits.
  • Performance tests measure speed, not cost. A fast system that makes 10x unnecessary API calls looks healthy in load tests.
  • Code reviews check that retry logic exists but rarely verify that retries have a ceiling and exponential backoff.
  • Monitoring dashboards track error rates and latency but rarely track cost-per-request or daily API spend.

This would be called a Cost Explosion Audit -- specifically testing whether bugs, retries, and design flaws can create runaway spend under retry storms, duplicate operations, missing caches, and unbounded autoscaling conditions.


Automation Opportunities

TestAutomatable?Method
TEST-CE-001: Double-run costYESRun pipeline twice, count external API calls via mock/log
TEST-CE-002: Retry ceilingYESMock persistent failure, count total retries, assert bounded
TEST-CE-003: Cache effectivenessYESSame request twice, assert second uses cache (no external call)
TEST-CE-004: Preview API usagePARTIALMonitor API calls during preview interactions
TEST-CE-005: Storage growthYESCreate + delete cycle, compare storage usage
TEST-CE-006: Autoscale ceilingYESCheck infrastructure config for max instance limits
TEST-CE-007: Per-user cost isolationYESRapid-fire expensive operations, assert rate limit
# Automated retry ceiling verification
# Mock external API to always return 500
# Then trigger a pipeline and count total calls
CALL_COUNT=$(grep -c "POST /v1/images/generate" /var/log/external_calls.log)
MAX_EXPECTED=4  # 1 initial + 3 retries
[ "$CALL_COUNT" -le "$MAX_EXPECTED" ] && echo "PASS: $CALL_COUNT calls" || echo "FAIL: $CALL_COUNT calls (max $MAX_EXPECTED)"

# Automated storage orphan cost check
ORPHAN_BYTES=$(gsutil du -s gs://bucket/assets/ | awk '{print $1}')
REFERENCED_BYTES=$(psql -t -A -c "SELECT COALESCE(SUM(file_size_bytes), 0) FROM assets WHERE deleted_at IS NULL")
WASTE=$((ORPHAN_BYTES - REFERENCED_BYTES))
echo "Potential storage waste: $((WASTE / 1024 / 1024)) MB"

Reusable Audit Report Template

# Cost Explosion Audit Report

## System: _______________
## Date: YYYY-MM-DD
## Auditor: _______________
## Current Monthly Spend: $___

## Cost Vectors Identified
| Vector | Monthly Cost | Growth Rate | Safeguards | Risk |
|--------|-------------|-------------|------------|------|
| AI API calls | $___ | ___/month | rate limit/cache/circuit breaker | |
| Storage | $___ | ___GB/month | cleanup job/quotas | |
| Compute | $___ | N/A | autoscale ceiling | |

## Test Results
| Test ID | Description | Result | Evidence |
|---------|-------------|--------|----------|
| TEST-CE-001 | Double-run cost | PASS/FAIL | Second run API calls: ___ |
| TEST-CE-002 | Retry ceiling | PASS/FAIL | Max retries observed: ___ |
| TEST-CE-003 | Cache effectiveness | PASS/FAIL | Cache hit rate: ___% |
| TEST-CE-004 | Preview API usage | PASS/FAIL | Paid calls during preview: ___ |
| TEST-CE-005 | Storage growth | PASS/FAIL | Orphaned storage: ___ MB |
| TEST-CE-006 | Autoscale ceiling | PASS/FAIL | Max instances configured: ___ |
| TEST-CE-007 | Per-user cost isolation | PASS/FAIL | Rate limit enforced: yes/no |

## Worst-Case Cost Scenarios
| Scenario | Estimated Cost | Safeguard | Verdict |
|----------|---------------|-----------|---------|
| Infinite retry on AI API | $___ | Circuit breaker: yes/no | |
| All users generate simultaneously | $___ | Rate limit: yes/no | |
| Storage never cleaned up (1 year) | $___ | Cleanup job: yes/no | |

## Score: PASS / PARTIAL / FAIL

Priority Targeting

Run this audit FIRST if:

  • The system calls paid AI APIs (OpenAI, Anthropic, etc.)
  • Cloud bill has increased unexpectedly
  • Users can trigger expensive operations without confirmation
  • No rate limits exist on generation endpoints
  • Retry logic was recently modified
  • Storage costs are growing faster than user count
  • No billing alerts are configured

Install this skill directly: skilldb add production-audit-skills

Get CLI access →