Throughput & Scale Audit
Verify that the system behaves correctly and remains responsive as work volume increases. This audit catches the class of bugs that only appear at real-world scale: slow queries, memory bloat, queue congestion, UI freezes, and background job starvation. ## Key Points - 500+ assets (images, videos, documents, or domain-equivalent records) - 25+ locations / categories / parent entities - 50+ scenes / child entities per parent - 20+ projects with varying sizes (1 asset to 200 assets) - 5+ user accounts with different roles - 10,000+ log/event records - 100+ queued/completed jobs in history 1. Load the main listing page with 10 records. Count DB queries (via query log or ORM debug). 2. Load the same page with 500 records. Count DB queries. 3. Compare. - [ ] Query count at 500 records is within 2x of query count at 10 records. - [ ] Absolute query count per page load is under 20. ## Quick Example ``` 10 records: 12 queries (1 list + 1 per record for author + 1 per record for thumbnail) 500 records: 1002 queries --> FAIL: Classic N+1 on author and thumbnail relations ``` ``` [ ] Separate queues for bulk vs interactive jobs [ ] Worker concurrency limits per job type [ ] Priority levels implemented (at least: critical, normal, bulk) [ ] Dead letter queue for repeatedly failing jobs [ ] Backpressure mechanism (reject new jobs when queue > threshold) ```
skilldb get production-audit-skills/throughput-scale-auditFull skill: 485 linesThroughput & Scale Audit
Purpose
Verify that the system behaves correctly and remains responsive as work volume increases. This audit catches the class of bugs that only appear at real-world scale: slow queries, memory bloat, queue congestion, UI freezes, and background job starvation.
Most systems work fine with 5 records. This audit tests what happens with 500.
Scope
| Layer | What We Test |
|---|---|
| Database | Query count scaling, index coverage, N+1 patterns, connection pool exhaustion |
| API | Response time degradation, payload size growth, pagination correctness |
| Queue / Jobs | Throughput ceiling, job starvation, priority inversion, backpressure |
| UI | Render time on large datasets, memory consumption, virtual scrolling |
| Storage | Upload/download throughput, listing latency on large buckets |
| Concurrency | Behavior under 5+ simultaneous users or jobs |
Risk Pattern Table
| Pattern | What It Hits | Risk | Symptom |
|---|---|---|---|
| N+1 queries | DB, API | HIGH | Page load scales linearly with record count; 500 assets = 500 queries |
| Unbounded SELECT | DB, API, UI | HIGH | Loading "all projects" fetches every record; OOM at scale |
| Sequential external calls | API, Jobs | HIGH | 50 scenes processed one-at-a-time; total time = N * per-call latency |
| Missing DB indexes | DB | HIGH | Queries degrade from 5ms to 5s as table grows past 10k rows |
| No pagination | API, UI | MEDIUM | API returns unbounded arrays; UI attempts to render thousands of DOM nodes |
| Connection pool exhaustion | DB | HIGH | Under concurrent load, requests queue waiting for DB connections; timeouts cascade |
| Memory-heavy aggregation | API, Jobs | MEDIUM | Aggregating large datasets in application memory instead of DB; OOM on workers |
| Queue single-consumer | Jobs | MEDIUM | One worker processes all job types; long jobs block short ones |
| Missing query result limits | DB | HIGH | Admin endpoints or internal queries lack LIMIT; accidental full-table scans |
| UI re-render storms | UI | MEDIUM | State changes trigger full list re-render; 500-item list causes frame drops |
| Payload serialization bloat | API | MEDIUM | Nested relations serialized redundantly; 1MB+ JSON responses |
| Background job starvation | Jobs | HIGH | Bulk operations consume all workers; user-triggered jobs wait indefinitely |
Pre-Audit Data Setup
Before running tests, seed the environment with realistic scale data:
Minimum test dataset:
- 500+ assets (images, videos, documents, or domain-equivalent records)
- 25+ locations / categories / parent entities
- 50+ scenes / child entities per parent
- 20+ projects with varying sizes (1 asset to 200 assets)
- 5+ user accounts with different roles
- 10,000+ log/event records
- 100+ queued/completed jobs in history
Seed Script Template
# Pseudocode for seeding scale data
for i in $(seq 1 25); do
location_id=$(create_location "Location $i")
for j in $(seq 1 50); do
scene_id=$(create_scene $location_id "Scene $j")
for k in $(seq 1 10); do
create_asset $scene_id "Asset $k" --type=random
done
done
done
# Result: 25 locations * 50 scenes * 10 assets = 12,500 assets
Concrete Test Cases
TEST-TP-001: Page Load Query Count Scaling
Objective: Verify query count does not scale linearly with record count.
Steps:
- Load the main listing page with 10 records. Count DB queries (via query log or ORM debug).
- Load the same page with 500 records. Count DB queries.
- Compare.
Pass Criteria:
- Query count at 500 records is within 2x of query count at 10 records.
- Absolute query count per page load is under 20.
- No query appears more than once with only the ID parameter changing (N+1 signal).
- Page load completes under 2s with 500 assets.
- No single query takes more than 100ms.
Fail Example:
10 records: 12 queries (1 list + 1 per record for author + 1 per record for thumbnail)
500 records: 1002 queries
--> FAIL: Classic N+1 on author and thumbnail relations
Fix Pattern:
-- Before (N+1):
SELECT * FROM assets WHERE project_id = ?;
-- Then for each asset:
SELECT * FROM users WHERE id = ?;
-- After (eager load):
SELECT * FROM assets WHERE project_id = ?;
SELECT * FROM users WHERE id IN (?, ?, ?, ...);
TEST-TP-002: API Response Time Under Load
Objective: Verify API endpoints remain responsive at scale.
Steps:
- Identify the 10 most-used API endpoints.
- Measure p50, p95, p99 response times with 10 records.
- Measure again with 500+ records.
- Measure again with 5 concurrent users.
Benchmarks:
| Endpoint Type | p50 Target | p95 Target | p99 Target |
|---|---|---|---|
| List (paginated) | < 200ms | < 500ms | < 1000ms |
| Single record | < 100ms | < 200ms | < 500ms |
| Create/Update | < 300ms | < 500ms | < 1000ms |
| Search | < 300ms | < 700ms | < 1500ms |
| Dashboard/Aggregate | < 500ms | < 1000ms | < 2000ms |
Pass Criteria:
- No endpoint p50 exceeds its target in the table above.
- Response times at 500 records are within 2x of baseline at 10 records.
- No p99 exceeds 3 seconds.
- No endpoint times out (> 30s) under 5 concurrent users.
- Query count stays under 50 for any project size.
Fail Criteria:
- Any endpoint exceeds 2x its baseline when data volume increases 50x.
- Any p99 exceeds 3 seconds.
- Any endpoint times out (> 30s) under concurrent load.
TEST-TP-003: UI Responsiveness on Large Datasets
Objective: Verify the UI remains interactive when rendering large lists.
Steps:
- Navigate to a page that lists assets/records.
- Load it with 500+ items (or the maximum a user could realistically have).
- Measure: initial render time, scroll smoothness (FPS), memory consumption.
- Interact: filter, sort, select-all, bulk action.
Benchmarks:
| Metric | Target | Fail Threshold |
|---|---|---|
| Initial render (TTI) | < 2s | > 5s |
| Scroll FPS | > 30fps | < 15fps |
| Memory (tab) | < 200MB | > 500MB |
| Filter response | < 500ms | > 2s |
| Select-all | < 200ms | > 1s |
What to Check:
- Is virtual scrolling / windowing used for lists > 100 items?
- Are images lazy-loaded?
- Does select-all trigger individual re-renders per item?
- Does filtering happen client-side or server-side? (Client-side with 500+ records = potential freeze)
TEST-TP-004: Queue Congestion Under Bulk Operations
Objective: Verify that bulk operations do not starve other job types.
Steps:
- Trigger a bulk operation that creates 50+ jobs (e.g., "generate all assets for project").
- Immediately trigger a single high-priority job (e.g., user-initiated single generation).
- Measure how long the single job waits before starting.
Pass Criteria:
- Single job starts within 30 seconds regardless of queue depth.
- Bulk jobs do not consume 100% of available workers.
- Queue depth is observable (logged, metriced, or visible in admin).
- Priority queue processes interactive jobs before bulk jobs.
- Dead letter queue catches jobs that fail after max retries.
Fail Criteria:
- Single job waits > 2 minutes because bulk jobs hold all workers.
- No priority mechanism exists.
- Queue depth is invisible (no logging, no metrics).
Architecture Check:
[ ] Separate queues for bulk vs interactive jobs
[ ] Worker concurrency limits per job type
[ ] Priority levels implemented (at least: critical, normal, bulk)
[ ] Dead letter queue for repeatedly failing jobs
[ ] Backpressure mechanism (reject new jobs when queue > threshold)
TEST-TP-005: Database Connection Pool Under Concurrent Load
Objective: Verify the connection pool handles concurrent requests without exhaustion.
Steps:
- Determine the configured pool size (e.g.,
max_connections = 20). - Send concurrent requests equal to pool size + 10.
- Measure: do excess requests queue gracefully or error?
- Send sustained concurrent load for 60 seconds.
Pass Criteria:
- No "connection pool exhausted" errors under 5 concurrent users.
- Queued requests complete within 5 seconds.
- No leaked connections after load subsides (pool returns to baseline).
- Connection pool utilization stays under 80% during normal load.
- Pool size is configured explicitly (not using library default).
Fail Criteria:
- Errors at concurrent load below 2x pool size.
- Connections leak (pool size grows monotonically under load).
- Long-running queries hold connections, blocking short queries.
TEST-TP-006: Sequential External Call Bottleneck
Objective: Verify that external API calls (providers, storage, auth) are parallelized where possible.
Steps:
- Identify all external calls in a typical workflow (e.g., asset generation pipeline).
- Trace a single workflow execution. Record: call sequence, timing, dependencies.
- Identify calls that could run in parallel but execute sequentially.
Pass Criteria:
- Independent external calls execute concurrently.
- Total workflow time is closer to max(call_times) than sum(call_times).
- Timeout per external call is configured (not relying on global timeout).
- Each external call has individual error handling (one failure does not block others).
- Fire-and-forget calls (webhooks, notifications) do not block the main workflow.
Example Analysis:
Sequential (bad):
upload_to_storage: 2s
call_ai_provider: 8s
update_metadata: 0.5s
notify_webhook: 1s
Total: 11.5s
Parallel where possible (good):
call_ai_provider: 8s (blocking - needs result)
upload_to_storage: 2s (parallel with metadata update)
update_metadata: 0.5s (parallel with storage)
notify_webhook: 1s (fire-and-forget)
Total: ~9s
TEST-TP-007: Memory Consumption Under Scale
Objective: Verify that memory usage remains bounded as data volume grows.
Steps:
- Record baseline memory of API server / worker process.
- Process a batch of 100 items. Record peak memory.
- Process a batch of 500 items. Record peak memory.
- After processing, verify memory returns to near-baseline (no leak).
Thresholds:
| Process Type | Baseline | Per-100 Items | Max Acceptable |
|---|---|---|---|
| API server | < 200MB | + < 50MB | 512MB |
| Background worker | < 150MB | + < 100MB | 1GB |
| Frontend (browser tab) | < 100MB | + < 30MB | 500MB |
Leak Detection:
Run the same operation 10 times sequentially.
Record memory after each run.
If memory increases monotonically without returning to baseline: LEAK.
Run 1: 180MB -> 210MB (post-GC: 185MB) OK
Run 2: 185MB -> 215MB (post-GC: 186MB) OK
Run 3: 186MB -> 220MB (post-GC: 200MB) SUSPECT
Run 4: 200MB -> 240MB (post-GC: 215MB) LEAK CONFIRMED
TEST-TP-008: Pagination Correctness at Scale
Objective: Verify pagination returns all records without duplicates or gaps.
Steps:
- Create exactly 523 records (prime number to catch off-by-one errors).
- Paginate through all pages (page_size=50).
- Collect all returned IDs.
- Verify: exactly 523 unique IDs, no duplicates, no missing records.
- Repeat while another user is creating new records simultaneously.
Pass Criteria:
- All 523 records appear exactly once across all pages.
- Total count matches actual count.
- Concurrent writes do not cause duplicates or gaps (cursor-based pagination preferred).
- Last page returns fewer items and
has_more = false. - Page size parameter is validated (min 1, max 100).
Fail Patterns:
- Offset-based pagination with concurrent inserts: items shift, causing duplicates/gaps.
- Last page returns 0 items but
has_more = true. - Total count is cached and stale.
Scaling Expectation Matrix
| Metric | 10 Records | 100 Records | 500 Records | Expected Scaling |
|---|---|---|---|---|
| Query count | N | ~N | ~N | O(1) - constant |
| Response time | T | ~T | ~1.5T | O(1) or O(log n) |
| Memory | M | ~1.2M | ~1.5M | O(1) with pagination |
| Payload size | S | ~S | ~S | O(1) with pagination |
| Job throughput | J/min | J/min | J/min | O(1) - worker-bound |
If any metric scales as O(n) or worse, the system has a scalability defect.
Reusable Load Test Template
# k6 / artillery / custom script configuration
scenarios:
baseline:
executor: "constant-vus"
vus: 1
duration: "30s"
concurrent_users:
executor: "constant-vus"
vus: 5
duration: "60s"
ramp_up:
executor: "ramping-vus"
stages:
- duration: "30s"
target: 1
- duration: "60s"
target: 10
- duration: "30s"
target: 0
thresholds:
http_req_duration:
- "p(50) < 200"
- "p(95) < 500"
- "p(99) < 1000"
http_req_failed:
- "rate < 0.01" # Less than 1% error rate
Post-Audit Checklist
[ ] All N+1 queries identified and fixed or documented
[ ] All unbounded queries have LIMIT clauses
[ ] Pagination is cursor-based for concurrent-write scenarios
[ ] Virtual scrolling implemented for lists > 100 items
[ ] Background jobs have priority levels
[ ] Queue depth is monitored and alerted
[ ] Connection pool is sized appropriately for expected concurrency
[ ] Memory usage is bounded and tested under load
[ ] External calls are parallelized where dependencies allow
[ ] Response time benchmarks documented for top 10 endpoints
What Earlier Audits Miss
Standard performance testing uses small datasets and single users. This audit matters because:
- Load tests without realistic data hit fast paths. N+1 queries only appear with real record counts.
- Unit test coverage does not catch query count scaling. A test with 3 records passes; production has 3,000.
- Frontend profiling in dev uses empty states. Real users have 500-item lists, full dashboards, and heavy DOM trees.
- API benchmarks without concurrency miss connection pool exhaustion and lock contention.
- Generic performance budgets (e.g., "pages load under 3s") do not catch that a page loads in 200ms with 10 records but 12s with 500.
This would be called a Throughput & Scale Audit -- specifically testing whether response times, query counts, and memory usage remain bounded under realistic data volumes and concurrent load.
Automation Opportunities
| Test | Automatable? | Method |
|---|---|---|
| TEST-TP-001: Query count scaling | YES | CI test: seed N records, assert query count < threshold |
| TEST-TP-002: API response time | YES | k6/artillery load test in CI pipeline |
| TEST-TP-003: UI responsiveness | PARTIAL | Lighthouse CI for TTI; manual for scroll FPS |
| TEST-TP-004: Queue congestion | PARTIAL | Script to enqueue bulk + single; assert single starts within threshold |
| TEST-TP-005: Connection pool | YES | Concurrent HTTP requests via k6; monitor pool metrics |
| TEST-TP-006: Sequential calls | MANUAL | Requires trace analysis of workflow execution |
| TEST-TP-007: Memory consumption | YES | Process memory monitoring script during batch operations |
| TEST-TP-008: Pagination correctness | YES | Script to paginate and collect all IDs; assert uniqueness and completeness |
# Automated N+1 detection (Rails example)
# Add to CI test suite:
test "listing page query count does not scale with record count" do
create_list(:asset, 200)
query_count = count_queries { get "/api/projects/1/assets" }
assert query_count < 20, "Expected < 20 queries, got #{query_count}"
end
# k6 load test script for CI
# k6 run --out json=results.json load_test.js
# Then: jq '.metrics.http_req_duration.values.p95' results.json
Reusable Audit Report Template
# Throughput & Scale Audit Report
## System: _______________
## Date: YYYY-MM-DD
## Auditor: _______________
## Dataset Size: ___ records seeded
## Summary
[1-3 sentences: overall scaling behavior and critical findings]
## Test Results
| Test ID | Description | Result | Evidence |
|---------|-------------|--------|----------|
| TEST-TP-001 | Query count scaling | PASS/FAIL | 10 records: __ queries, 500 records: __ queries |
| TEST-TP-002 | API response time | PASS/FAIL | p95 at 500 records: __ms |
| TEST-TP-003 | UI responsiveness | PASS/FAIL | TTI at 500 items: __s, scroll FPS: __ |
| TEST-TP-004 | Queue congestion | PASS/FAIL | Single job wait time during bulk: __s |
| TEST-TP-005 | Connection pool | PASS/FAIL | Errors at __ concurrent requests |
| TEST-TP-006 | Sequential calls | PASS/FAIL | Workflow time: __s (sequential) vs __s (parallel) |
| TEST-TP-007 | Memory consumption | PASS/FAIL | Peak at 500 items: __MB, post-GC: __MB |
| TEST-TP-008 | Pagination correctness | PASS/FAIL | 523 records: __ unique IDs returned |
## Critical Findings
1. [Finding with severity, evidence, and fix recommendation]
## Score: PASS / PARTIAL / FAIL
Priority Targeting
Run this audit FIRST if:
- The system has recently added bulk operations
- Users report "it was fast at first but now it's slow"
- The dataset has grown 10x since last performance review
- New listing/dashboard pages have been added without pagination
- Background jobs are timing out under normal load
Install this skill directly: skilldb add production-audit-skills