Cost Explosion Audit
Verify that bugs, retries, and design flaws cannot create runaway spend. In systems that call paid external APIs (AI providers, cloud storage, SMS, email), a single bug can turn a $10 operation into a $10,000 incident. This audit identifies every path where costs can explode and verifies that safeguards exist.
## Key Points
1. Run a generation pipeline on a project (e.g., 10 assets).
2. Record: number of external API calls, total tokens/compute units, estimated cost.
3. Run the exact same pipeline again (same project, same parameters).
4. Record: number of additional external API calls.
- [ ] Second run makes ZERO additional API calls (results cached/reused).
- [ ] OR second run is blocked ("Assets already generated. Regenerate?").
- [ ] OR second run costs less than 10% of first (only changed items regenerated).
- Second run makes the same number of API calls as the first.
- No caching mechanism exists.
- User is not warned about cost of regeneration.
1. Simulate a persistent external API failure (mock 500 or timeout).
2. Trigger a pipeline that calls this API.
## Quick Example
```
| Run | API Calls | Tokens Used | Estimated Cost | New Assets | From Cache |
|-----|-----------|-------------|---------------|------------|------------|
| 1st | 10 | 15,000 | $0.45 | 10 | 0 |
| 2nd | 0 | 0 | $0.00 | 0 | 10 | <- PASS
| 2nd | 10 | 15,000 | $0.45 | 10 | 0 | <- FAIL
```
```
| Resource | Min Instances | Max Instances | Scale Metric | Cost at Max | Acceptable? |
|----------|-------------|-------------|-------------|------------|-------------|
| API server | 1 | 10 | CPU > 70% | $X/hour | [ ] Yes [ ] No |
| Workers | 1 | 5 | Queue depth > 50 | $X/hour | [ ] Yes [ ] No |
| Database | 1 | 1 (fixed) | N/A | $X/month | [ ] Yes [ ] No |
```skilldb get production-audit-skills/cost-explosion-auditFull skill: 449 linesCost Explosion Audit
Purpose
Verify that bugs, retries, and design flaws cannot create runaway spend. In systems that call paid external APIs (AI providers, cloud storage, SMS, email), a single bug can turn a $10 operation into a $10,000 incident. This audit identifies every path where costs can explode and verifies that safeguards exist.
Scope
| Cost Vector | What We Test |
|---|---|
| External API calls | Duplicate calls, retry storms, unnecessary invocations |
| Cloud compute | Runaway processes, zombie instances, autoscale limits |
| Storage | Unbounded growth, duplicate files, missing cleanup |
| Bandwidth | Large payload transfers, unnecessary downloads |
| Third-party services | SMS/email volume, search API calls, CDN bandwidth |
| Database | Oversized queries, missing indexes causing full scans |
Risk Pattern Table
| Pattern | What It Hits | Risk | Symptom |
|---|---|---|---|
| Infinite retry loop | API costs, compute | CRITICAL | Failed call retries forever; provider bill spikes |
| Duplicate generation on retry | API costs | HIGH | User retries, system generates twice, charged twice |
| Preview triggers paid API | API costs | HIGH | Every preview click calls the expensive API |
| No caching of API results | API costs | HIGH | Same prompt re-generated repeatedly instead of cached |
| Storage never cleaned up | Storage costs | MEDIUM | Orphaned files accumulate; storage bill grows monotonically |
| Autoscale without ceiling | Compute costs | CRITICAL | Traffic spike scales to 100 instances; $1000/hour |
| Batch retry = full re-run | API costs | HIGH | 1 of 20 items fails; retry re-runs all 20 |
| Verbose logging in production | Storage, bandwidth | MEDIUM | DEBUG-level logs consume GB/day of log storage |
| Missing query pagination | Compute, DB | MEDIUM | Full-table scans on every request; DB costs spike |
| Webhook retry causing re-processing | API costs | HIGH | Provider retries webhook; system re-generates assets |
Concrete Test Cases
TEST-CE-001: Pipeline Double-Run Cost Comparison
Objective: Verify that running a pipeline twice does not double the billable API calls.
Steps:
- Run a generation pipeline on a project (e.g., 10 assets).
- Record: number of external API calls, total tokens/compute units, estimated cost.
- Run the exact same pipeline again (same project, same parameters).
- Record: number of additional external API calls.
Pass Criteria:
- Second run makes ZERO additional API calls (results cached/reused).
- OR second run is blocked ("Assets already generated. Regenerate?").
- OR second run costs less than 10% of first (only changed items regenerated).
Fail Criteria:
- Second run makes the same number of API calls as the first.
- No caching mechanism exists.
- User is not warned about cost of regeneration.
Cost Tracking Template:
| Run | API Calls | Tokens Used | Estimated Cost | New Assets | From Cache |
|-----|-----------|-------------|---------------|------------|------------|
| 1st | 10 | 15,000 | $0.45 | 10 | 0 |
| 2nd | 0 | 0 | $0.00 | 0 | 10 | <- PASS
| 2nd | 10 | 15,000 | $0.45 | 10 | 0 | <- FAIL
TEST-CE-002: Retry Ceiling on External API Failures
Objective: Verify that failed external API calls have a bounded retry limit.
Steps:
- Simulate a persistent external API failure (mock 500 or timeout).
- Trigger a pipeline that calls this API.
- Observe retry behavior.
- Count total API calls made.
Pass Criteria:
- Maximum retry count is configured (e.g., 3 retries).
- Exponential backoff between retries.
- After max retries, job moves to "failed" state (not infinite retry).
- Total API calls = initial + max_retries (e.g., 4 total for 3 retries).
- Retry metrics are logged (attempt number, delay, error).
Fail Criteria:
- No retry limit; calls continue indefinitely.
- Linear retry (no backoff); thundering herd on provider recovery.
- Retry count > 10 for a single item.
- Retries continue even after circuit breaker should have tripped.
Retry Configuration Audit:
| Operation | Max Retries | Backoff | Base Delay | Max Delay | Circuit Breaker |
|-----------|------------|---------|-----------|-----------|-----------------|
| AI generation | 3 | Exponential | 1s | 30s | After 5 failures in 1min |
| Storage upload | 3 | Exponential | 500ms | 10s | N/A |
| Webhook delivery | 5 | Exponential | 1s | 5min | N/A |
| Payment API | 2 | Exponential | 2s | 10s | After 3 failures in 1min |
| Email send | 3 | Linear | 5s | 15s | After 10 failures in 5min |
TEST-CE-003: Cache Effectiveness for API Results
Objective: Verify that cacheable API results are actually cached and reused.
Steps:
- Make an API call with specific parameters. Record the result.
- Make the same call with the same parameters.
- Verify: was the cached result used, or was a new API call made?
Cache Verification Points:
[ ] Generation results cached by (input_hash, model, parameters)
[ ] Cache checked BEFORE making external API call
[ ] Cache hit rate logged and monitored
[ ] Cache invalidation works (changed parameters = new call)
[ ] Cache has TTL appropriate for content type
[ ] Cache storage cost is less than API re-call cost
Cache Hit Rate Targets:
| Content Type | Target Cache Hit Rate | Cache TTL |
|-------------|----------------------|-----------|
| Image generation (same prompt) | > 95% | 30 days |
| Text generation (same prompt) | > 90% | 7 days |
| Search/lookup results | > 80% | 1 hour |
| User profile/permissions | > 95% | 5 minutes |
| Configuration/settings | > 99% | 1 minute |
TEST-CE-004: Preview vs Production API Usage
Objective: Verify that preview/draft operations do not invoke paid APIs.
Steps:
- Navigate through all preview/draft features in the UI.
- Monitor external API calls during preview actions.
- Compare with production/generate actions.
Pass Criteria:
- Previews use free/cheap alternatives (thumbnails, placeholders, cached results).
- Preview actions are clearly distinct from generate actions in code.
- No paid API call is triggered without explicit user confirmation.
- Draft saves do not trigger generation pipelines.
Fail Criteria:
- Loading a preview page triggers a paid API call.
- Typing in a prompt field triggers live API calls (without debounce/explicit submit).
- Auto-save triggers regeneration.
TEST-CE-005: Storage Growth Analysis
Objective: Verify that storage usage is bounded and orphaned files are cleaned up.
Steps:
- Record current storage usage.
- Create and delete 10 projects with assets.
- Record storage usage after deletion.
- Wait for any cleanup jobs to run.
- Record final storage usage.
Pass Criteria:
- Storage after create+delete is within 10% of starting usage.
- Orphaned files are cleaned up by automated process.
- Cleanup job runs on a schedule (daily or weekly).
- Temporary files (uploads, processing intermediates) are cleaned up within 24 hours.
- Storage growth rate is tracked and alerted on anomalies.
Storage Audit Checklist:
[ ] Every file creation has a corresponding deletion path
[ ] Temporary files have TTL or cleanup job
[ ] Failed upload files are cleaned up
[ ] Replaced assets have old versions cleaned up (or explicitly versioned)
[ ] Deleted entity assets are cleaned up (cascade delete or cleanup job)
[ ] Storage usage is monitored per tenant/project
[ ] Storage quotas exist per tenant (prevent single user from consuming TB)
TEST-CE-006: Autoscale Ceiling Verification
Objective: Verify that autoscaling has maximum limits to prevent cost explosion.
Steps:
- Check autoscale configuration for all compute resources.
- Verify maximum instance/replica count is configured.
- Calculate worst-case cost at maximum scale.
- Verify this cost is acceptable.
Autoscale Audit:
| Resource | Min Instances | Max Instances | Scale Metric | Cost at Max | Acceptable? |
|----------|-------------|-------------|-------------|------------|-------------|
| API server | 1 | 10 | CPU > 70% | $X/hour | [ ] Yes [ ] No |
| Workers | 1 | 5 | Queue depth > 50 | $X/hour | [ ] Yes [ ] No |
| Database | 1 | 1 (fixed) | N/A | $X/month | [ ] Yes [ ] No |
Pass Criteria:
- Every autoscaling resource has a maximum limit.
- Maximum scale cost is known and documented.
- Alerts fire before reaching maximum scale.
- Scale-down is configured (not just scale-up).
- Billing alerts configured at 50%, 80%, 100% of budget.
TEST-CE-007: Per-User / Per-Tenant Cost Isolation
Objective: Verify that one user's activity cannot cause unbounded costs.
Steps:
- Check: are there per-user rate limits on expensive operations?
- Check: are there per-user quotas on resource consumption?
- Simulate: one user triggers 100 generations in rapid succession.
Pass Criteria:
- Rate limits exist on all paid API triggers (e.g., 10 generations/hour/user).
- Quotas exist on storage (e.g., 10GB per project).
- Exceeding limits returns clear message (not silent failure or cost pass-through).
- Admin can view per-user resource consumption.
Worst-Case Cost Estimation Methodology
For every paid operation, calculate:
Worst Case = (max_retries + 1) * cost_per_call * max_concurrent_operations * max_items_per_operation
Example: AI Image Generation
cost_per_call = $0.04 (DALL-E 3)
max_retries = 3
max_concurrent_operations = 5 (users)
max_items_per_operation = 100 (assets per project)
Worst case per incident = (3+1) * $0.04 * 5 * 100 = $80
With infinite retry bug:
Worst case = UNBOUNDED (retries forever until provider blocks or budget exhausted)
Cost Estimation Template:
| Operation | Per-Call Cost | Max Retries | Max Concurrent | Max Items | Normal Cost | Worst Case | With Bug |
|-----------|-------------|------------|---------------|-----------|-------------|------------|----------|
| AI generation | $0.04 | 3 | 5 | 100 | $20 | $80 | UNBOUNDED |
| Storage upload | $0.001 | 3 | 10 | 200 | $0.20 | $0.80 | ~$1 |
| Email send | $0.001 | 3 | 50 | 1 | $0.05 | $0.20 | ~$50/day |
| SMS notify | $0.05 | 2 | 10 | 1 | $0.50 | $1.50 | ~$500/day |
Circuit Breaker Implementation
When an external service fails repeatedly, stop calling it:
States:
CLOSED (normal): Calls go through. Track failure count.
OPEN (tripped): All calls immediately fail without making external call.
HALF-OPEN (testing): Allow one test call to check if service recovered.
Configuration:
failure_threshold: 5 failures in 60 seconds -> OPEN
open_duration: 30 seconds -> HALF-OPEN
success_threshold: 2 successes in HALF-OPEN -> CLOSED
Benefits:
- Prevents cost accumulation during outage
- Reduces load on struggling external service
- Fast failure for users (no waiting for timeout)
- Automatic recovery when service returns
Billing Alert Template
alerts:
- name: "daily_api_cost_warning"
condition: "daily_api_spend > $50"
severity: "warning"
action: "notify_team_channel"
- name: "daily_api_cost_critical"
condition: "daily_api_spend > $200"
severity: "critical"
action: "page_oncall + disable_generation"
- name: "per_user_spend_anomaly"
condition: "user_daily_spend > 10x user_average"
severity: "warning"
action: "rate_limit_user + notify_team"
- name: "retry_storm_detected"
condition: "retry_count > 100 in 5 minutes"
severity: "critical"
action: "trip_circuit_breaker + page_oncall"
- name: "storage_growth_anomaly"
condition: "daily_storage_delta > 10GB"
severity: "warning"
action: "notify_team"
Post-Audit Checklist
[ ] All external API calls have retry limits (max 3-5)
[ ] Exponential backoff configured on all retries
[ ] Circuit breaker implemented for external services
[ ] Results cached where possible (cache before API call)
[ ] Preview/draft operations do not trigger paid APIs
[ ] Storage cleanup automated for orphaned/temporary files
[ ] Autoscale has maximum limits on all resources
[ ] Per-user rate limits on expensive operations
[ ] Billing alerts configured at 50%, 80%, 100% of budget
[ ] Worst-case cost estimated and documented for each operation
[ ] Duplicate pipeline runs are blocked or use cached results
[ ] Batch retry targets only failed items (not full batch)
[ ] Log verbosity is appropriate for production (not DEBUG)
[ ] Storage quotas exist per tenant
[ ] Cost per operation is logged and aggregable
What Earlier Audits Miss
Standard testing verifies functionality and performance. This audit matters because:
- Unit tests mock external APIs. They never discover that a retry bug calls the real API 1,000 times.
- Integration tests run against sandboxes with unlimited quotas. They never catch missing rate limits.
- Performance tests measure speed, not cost. A fast system that makes 10x unnecessary API calls looks healthy in load tests.
- Code reviews check that retry logic exists but rarely verify that retries have a ceiling and exponential backoff.
- Monitoring dashboards track error rates and latency but rarely track cost-per-request or daily API spend.
This would be called a Cost Explosion Audit -- specifically testing whether bugs, retries, and design flaws can create runaway spend under retry storms, duplicate operations, missing caches, and unbounded autoscaling conditions.
Automation Opportunities
| Test | Automatable? | Method |
|---|---|---|
| TEST-CE-001: Double-run cost | YES | Run pipeline twice, count external API calls via mock/log |
| TEST-CE-002: Retry ceiling | YES | Mock persistent failure, count total retries, assert bounded |
| TEST-CE-003: Cache effectiveness | YES | Same request twice, assert second uses cache (no external call) |
| TEST-CE-004: Preview API usage | PARTIAL | Monitor API calls during preview interactions |
| TEST-CE-005: Storage growth | YES | Create + delete cycle, compare storage usage |
| TEST-CE-006: Autoscale ceiling | YES | Check infrastructure config for max instance limits |
| TEST-CE-007: Per-user cost isolation | YES | Rapid-fire expensive operations, assert rate limit |
# Automated retry ceiling verification
# Mock external API to always return 500
# Then trigger a pipeline and count total calls
CALL_COUNT=$(grep -c "POST /v1/images/generate" /var/log/external_calls.log)
MAX_EXPECTED=4 # 1 initial + 3 retries
[ "$CALL_COUNT" -le "$MAX_EXPECTED" ] && echo "PASS: $CALL_COUNT calls" || echo "FAIL: $CALL_COUNT calls (max $MAX_EXPECTED)"
# Automated storage orphan cost check
ORPHAN_BYTES=$(gsutil du -s gs://bucket/assets/ | awk '{print $1}')
REFERENCED_BYTES=$(psql -t -A -c "SELECT COALESCE(SUM(file_size_bytes), 0) FROM assets WHERE deleted_at IS NULL")
WASTE=$((ORPHAN_BYTES - REFERENCED_BYTES))
echo "Potential storage waste: $((WASTE / 1024 / 1024)) MB"
Reusable Audit Report Template
# Cost Explosion Audit Report
## System: _______________
## Date: YYYY-MM-DD
## Auditor: _______________
## Current Monthly Spend: $___
## Cost Vectors Identified
| Vector | Monthly Cost | Growth Rate | Safeguards | Risk |
|--------|-------------|-------------|------------|------|
| AI API calls | $___ | ___/month | rate limit/cache/circuit breaker | |
| Storage | $___ | ___GB/month | cleanup job/quotas | |
| Compute | $___ | N/A | autoscale ceiling | |
## Test Results
| Test ID | Description | Result | Evidence |
|---------|-------------|--------|----------|
| TEST-CE-001 | Double-run cost | PASS/FAIL | Second run API calls: ___ |
| TEST-CE-002 | Retry ceiling | PASS/FAIL | Max retries observed: ___ |
| TEST-CE-003 | Cache effectiveness | PASS/FAIL | Cache hit rate: ___% |
| TEST-CE-004 | Preview API usage | PASS/FAIL | Paid calls during preview: ___ |
| TEST-CE-005 | Storage growth | PASS/FAIL | Orphaned storage: ___ MB |
| TEST-CE-006 | Autoscale ceiling | PASS/FAIL | Max instances configured: ___ |
| TEST-CE-007 | Per-user cost isolation | PASS/FAIL | Rate limit enforced: yes/no |
## Worst-Case Cost Scenarios
| Scenario | Estimated Cost | Safeguard | Verdict |
|----------|---------------|-----------|---------|
| Infinite retry on AI API | $___ | Circuit breaker: yes/no | |
| All users generate simultaneously | $___ | Rate limit: yes/no | |
| Storage never cleaned up (1 year) | $___ | Cleanup job: yes/no | |
## Score: PASS / PARTIAL / FAIL
Priority Targeting
Run this audit FIRST if:
- The system calls paid AI APIs (OpenAI, Anthropic, etc.)
- Cloud bill has increased unexpectedly
- Users can trigger expensive operations without confirmation
- No rate limits exist on generation endpoints
- Retry logic was recently modified
- Storage costs are growing faster than user count
- No billing alerts are configured
Install this skill directly: skilldb add production-audit-skills