Concurrency & Race Condition Audit
Verify that the system behaves correctly when multiple operations happen simultaneously. Race conditions are among the hardest bugs to detect in testing and the most damaging in production. They cause data corruption, duplicate records, lost updates, and security breaches.
## Key Points
1. Navigate to a page with a "Generate" or "Submit" button.
2. Click the button twice in rapid succession (< 200ms apart).
3. Alternatively, use the API to send two identical POST requests simultaneously:
4. Inspect results.
- [ ] Only one job/record is created.
- [ ] Second request receives: already-in-progress response, or same job ID.
- [ ] UI button is disabled after first click (optimistic guard).
- [ ] No duplicate entries in any table.
1. Open a resource (project, document, settings) in Tab A.
2. Open the same resource in Tab B.
3. In Tab A, change field X and save.
4. In Tab B (still showing old data), change field Y and save.
## Quick Example
```bash
curl -X POST /api/generate -d '{"project_id": "123"}' &
curl -X POST /api/generate -d '{"project_id": "123"}' &
wait
```
```
[ ] UI: Button disabled on click, re-enabled on response/timeout
[ ] API: Idempotency key required on create endpoints
[ ] API: Mutex check before job creation (SELECT FOR UPDATE or equivalent)
[ ] DB: Unique constraint on (entity_id, operation_type, status='active')
```skilldb get production-audit-skills/concurrency-race-condition-auditFull skill: 408 linesConcurrency & Race Condition Audit
Purpose
Verify that the system behaves correctly when multiple operations happen simultaneously. Race conditions are among the hardest bugs to detect in testing and the most damaging in production. They cause data corruption, duplicate records, lost updates, and security breaches.
This audit systematically tests every path where two operations can collide.
Scope
| Concurrency Scenario | What We Test |
|---|---|
| Two users editing same resource | Last-write-wins, merge conflicts, data loss |
| Two jobs processing same item | Duplicate output, corrupted state |
| Double-click / double-submit | Duplicate records, double billing |
| Two tabs, same user | Session conflicts, stale data overwrites |
| Delayed job completing after newer one | Out-of-order writes, stale data surfacing |
| Bulk operation + individual edit | Conflicting writes, partial visibility |
| Read-modify-write cycles | Lost updates, inconsistent derived state |
Risk Pattern Table
| Pattern | What It Hits | Risk | Symptom |
|---|---|---|---|
| Read-modify-write without lock | Data | CRITICAL | Two users read version 1, both write "version 2", one update lost |
| Double-submit without dedupe | API, Data | HIGH | Two records created, double charge, double email |
| Last-write-wins without conflict detection | Data | HIGH | User A's changes silently overwritten by User B |
| Out-of-order job completion | Data, UI | MEDIUM | Old result overwrites newer result; stale data displayed |
| Non-atomic compound operation | Data | HIGH | Two-step operation (create + link) interrupted between steps |
| Cache not invalidated on write | Data, UI | MEDIUM | User sees stale data after another user's update |
| Shared mutable state in workers | Jobs | HIGH | Two workers modify same in-memory structure; corruption |
| Transaction isolation too low | DB | HIGH | Dirty reads, phantom reads, non-repeatable reads |
| File write without locking | Storage | MEDIUM | Two processes write same file; corrupted output |
| Counter increment without atomic operation | Data | MEDIUM | Two increments: expected +2, actual +1 (lost update) |
Concrete Test Cases
TEST-RC-001: Double-Click Generate / Submit
Objective: Verify that rapidly clicking a trigger button does not create duplicate work.
Steps:
- Navigate to a page with a "Generate" or "Submit" button.
- Click the button twice in rapid succession (< 200ms apart).
- Alternatively, use the API to send two identical POST requests simultaneously:
curl -X POST /api/generate -d '{"project_id": "123"}' & curl -X POST /api/generate -d '{"project_id": "123"}' & wait - Inspect results.
Pass Criteria:
- Only one job/record is created.
- Second request receives: already-in-progress response, or same job ID.
- UI button is disabled after first click (optimistic guard).
- No duplicate entries in any table.
Implementation Verification:
[ ] UI: Button disabled on click, re-enabled on response/timeout
[ ] API: Idempotency key required on create endpoints
[ ] API: Mutex check before job creation (SELECT FOR UPDATE or equivalent)
[ ] DB: Unique constraint on (entity_id, operation_type, status='active')
TEST-RC-002: Two-Tab Same Resource Editing
Objective: Verify that editing the same resource in two browser tabs does not silently lose changes.
Steps:
- Open a resource (project, document, settings) in Tab A.
- Open the same resource in Tab B.
- In Tab A, change field X and save.
- In Tab B (still showing old data), change field Y and save.
- Reload the resource.
Pass Criteria (one of):
- Optimistic locking: Tab B's save is rejected with "Resource was modified. Please refresh."
- Field-level merge: Both changes are preserved (X from Tab A, Y from Tab B).
- Real-time sync: Tab B auto-updates when Tab A saves.
Fail Criteria:
- Tab B's save silently overwrites Tab A's changes (last-write-wins without warning).
- Both saves "succeed" but only one is persisted.
- No version tracking; overwrites are undetectable.
Implementation Check:
Optimistic locking pattern:
1. Read: GET /resource/123 -> { data: {...}, version: 5 }
2. Write: PUT /resource/123 { data: {...}, version: 5 }
3. Server: IF current_version != 5 THEN reject with 409 Conflict
4. Server: IF current_version == 5 THEN update, set version = 6
[ ] Version field exists on all mutable resources
[ ] Update endpoint checks version match
[ ] 409 Conflict response handled in UI with clear message
[ ] UI prompts user to refresh and re-apply changes
TEST-RC-003: Two Users Same Project Simultaneously
Objective: Verify that concurrent access by different users does not cause data corruption.
Steps:
- User A and User B both have access to Project X.
- User A adds Asset 1 to the project.
- Simultaneously, User B adds Asset 2 to the project.
- Both save.
- Reload: verify both assets are present.
Pass Criteria:
- Both assets are present after both saves complete.
- No data corruption in project metadata.
- Audit log shows both users' actions distinctly.
- If conflict, at least one user is notified.
Test Variations:
- Both users edit the SAME field -> conflict detection required.
- Both users add to a COLLECTION -> both should succeed (no conflict).
- One user deletes while another edits -> clear error for the editor.
TEST-RC-004: Delayed Job Finishing After Newer One
Objective: Verify that a slow old job completing does not overwrite a newer job's results.
Steps:
- Start Job A for Asset X (generation/processing).
- Job A takes unusually long (simulate with delay).
- User triggers Job B for the same Asset X (e.g., "regenerate").
- Job B completes first with Result B.
- Job A finally completes with Result A (now stale).
- Check: which result is stored for Asset X?
Pass Criteria:
- Result B (newer) is the active result for Asset X.
- Result A is either discarded or stored as a previous version (not active).
- The UI shows Result B, not Result A.
- Job A detects it was superseded and does not overwrite.
Implementation Check:
[ ] Job writes check: "Am I still the latest job for this resource?"
[ ] Version/sequence number compared before write
[ ] Superseded jobs are cancelled or their writes are no-ops
[ ] Write condition: WHERE version = expected_version or WHERE job_id = latest_job_id
TEST-RC-005: Concurrent Bulk + Individual Operation
Objective: Verify that a bulk operation and individual edit on overlapping resources do not conflict.
Steps:
- Start a bulk operation on 20 items (e.g., "regenerate all assets in project").
- While bulk is processing, individually edit one of those 20 items.
- Wait for both to complete.
- Inspect the individually edited item.
Pass Criteria:
- Individual edit takes precedence (user intent is more specific).
- OR bulk operation skips items being individually edited.
- OR conflict is surfaced to user clearly.
- No corrupted state: item is in ONE consistent state.
TEST-RC-006: Counter / Aggregate Consistency
Objective: Verify that counters and aggregates remain accurate under concurrent modification.
Steps:
- Check a counter value (e.g., project.asset_count = 10).
- Simultaneously add 3 assets from different sources (API, bulk, job).
- After all complete, check counter value.
Pass Criteria:
- Counter = 13 (10 + 3). Not 11, not 12.
- Atomic increment used (not read-increment-write).
- OR counter is derived (COUNT query) rather than stored.
Implementation Check:
-- BAD: read-modify-write (race condition)
SELECT asset_count FROM projects WHERE id = 1; -- returns 10
UPDATE projects SET asset_count = 11 WHERE id = 1; -- two processes both write 11
-- GOOD: atomic increment
UPDATE projects SET asset_count = asset_count + 1 WHERE id = 1;
-- BEST: derived count (no counter to drift)
SELECT COUNT(*) FROM assets WHERE project_id = 1;
TEST-RC-007: Distributed Lock Verification
Objective: Verify that distributed locks work correctly and do not deadlock.
Steps:
- Identify all places where locks are used (DB row locks, Redis locks, mutex).
- For each lock, verify:
- Lock has a TTL (time-to-live) to prevent permanent deadlock.
- Lock is released on both success and failure.
- Lock holder ID is tracked (to prevent accidental release by another process).
- Test: acquire lock, simulate crash (do not release), verify lock auto-expires.
Pass Criteria:
- All locks have TTL configured.
- Lock TTL is shorter than the operation's timeout.
- Orphaned locks are automatically cleaned up.
- Lock acquisition failures return clear errors (not hangs).
- No deadlock potential (locks always acquired in consistent order).
Lock Audit Template:
| Lock Name | Type | TTL | Scope | Auto-Release | Deadlock Risk |
|-----------|------|-----|-------|-------------|---------------|
| job_lock | Redis | 300s | per-job | on crash: yes (TTL) | LOW |
| edit_lock | DB row | 60s | per-resource | on crash: yes (transaction rollback) | LOW |
| queue_lock | Redis | 30s | per-queue | on crash: yes (TTL) | MEDIUM (if nested) |
Dedupe Key Patterns
For API Requests
Idempotency key: client-generated UUID sent in header
X-Idempotency-Key: 550e8400-e29b-41d4-a716-446655440000
Server behavior:
1. Check if key exists in idempotency store
2. If exists: return cached response (do not re-execute)
3. If not: execute, store response keyed by idempotency key
4. Key expires after 24 hours
Storage: Redis with TTL, or DB table with cleanup job
For Background Jobs
Dedupe key: hash of (entity_id + operation_type + parameters)
Before enqueue: check if active job exists with same dedupe key
If exists: return existing job ID (do not enqueue duplicate)
If not: enqueue new job with dedupe key
Cleanup: dedupe key cleared when job reaches terminal state
For Webhooks / Callbacks
Dedupe key: webhook event ID (provided by sender)
On receive: check if event ID already processed
If processed: return 200 (acknowledge) without re-executing
If not: process, record event ID with timestamp
Retention: event IDs retained for 7 days minimum
Optimistic Locking Implementation Guide
1. Add `version` (integer) or `updated_at` (timestamp) to every mutable entity.
2. Every UPDATE includes the version check:
UPDATE resources
SET data = ?, version = version + 1
WHERE id = ? AND version = ?;
If rows_affected = 0: version conflict -> return 409.
3. API response always includes current version:
{ "id": "123", "data": {...}, "version": 5 }
4. Client sends version back on update:
PUT /resources/123 { "data": {...}, "version": 5 }
5. UI handles 409 Conflict:
- Show: "This resource was modified by another user. Refresh to see changes."
- Optionally: show diff of changes.
Post-Audit Checklist
[ ] All create endpoints have idempotency key support
[ ] All update endpoints use optimistic locking (version check)
[ ] Double-click protection on all trigger buttons (UI)
[ ] Background jobs deduplicated by entity + operation
[ ] Out-of-order job completion handled (version/sequence check on write)
[ ] Counters use atomic operations or derived queries
[ ] Distributed locks have TTL and auto-cleanup
[ ] Webhook/callback processing is idempotent
[ ] No read-modify-write patterns without concurrency control
[ ] Race condition test suite exists and runs in CI
What Earlier Audits Miss
Standard testing runs operations sequentially. This audit matters because:
- Unit tests execute one operation at a time. Race conditions are invisible in sequential execution.
- Integration tests rarely send two requests simultaneously. The window for collision is milliseconds wide.
- Code reviews catch missing locks in obvious places but miss subtle read-modify-write patterns buried in business logic.
- QA testing uses one browser. Multi-user concurrent editing is never tested.
- Load testing measures throughput and latency, not data correctness under concurrent writes.
This would be called a Concurrency & Race Condition Audit -- specifically testing whether the system produces correct, consistent results when multiple operations execute simultaneously on shared resources.
Automation Opportunities
| Test | Automatable? | Method |
|---|---|---|
| TEST-RC-001: Double-click | YES | Concurrent curl requests; assert single resource created |
| TEST-RC-002: Two-tab editing | PARTIAL | Selenium: open two tabs, edit same resource, assert conflict detection |
| TEST-RC-003: Two users | YES | Concurrent API requests with different auth; assert both changes preserved |
| TEST-RC-004: Delayed job | YES | Mock: start old job, start new job, complete new first, complete old, assert newer wins |
| TEST-RC-005: Bulk + individual | PARTIAL | Concurrent API calls; assert consistent final state |
| TEST-RC-006: Counter consistency | YES | Concurrent increment requests; assert final count matches expected |
| TEST-RC-007: Lock verification | YES | Acquire lock, simulate crash, assert lock auto-expires |
# Automated race condition test: concurrent counter increment
INITIAL=$(curl -s /api/projects/123 | jq '.asset_count')
for i in $(seq 1 10); do
curl -s -X POST /api/projects/123/assets -d '{"name": "asset-'$i'"}' &
done
wait
sleep 2 # Allow eventual consistency
FINAL=$(curl -s /api/projects/123 | jq '.asset_count')
EXPECTED=$((INITIAL + 10))
[ "$FINAL" -eq "$EXPECTED" ] && echo "PASS: count=$FINAL" || echo "FAIL: expected=$EXPECTED got=$FINAL"
Reusable Audit Report Template
# Concurrency & Race Condition Audit Report
## System: _______________
## Date: YYYY-MM-DD
## Auditor: _______________
## Concurrent Access Points Identified
| Resource | Concurrent Access Pattern | Protection | Verdict |
|----------|-------------------------|------------|---------|
| ___ | Two users editing | Optimistic lock / None | PASS/FAIL |
## Test Results
| Test ID | Description | Result | Evidence |
|---------|-------------|--------|----------|
| TEST-RC-001 | Double-click | PASS/FAIL | Duplicates created: ___ |
| TEST-RC-002 | Two-tab editing | PASS/FAIL | Conflict detected: yes/no |
| TEST-RC-003 | Two users | PASS/FAIL | Data lost: yes/no |
| TEST-RC-004 | Delayed job | PASS/FAIL | Stale data surfaced: yes/no |
| TEST-RC-005 | Bulk + individual | PASS/FAIL | Consistent state: yes/no |
| TEST-RC-006 | Counter consistency | PASS/FAIL | Expected: ___, actual: ___ |
| TEST-RC-007 | Lock verification | PASS/FAIL | Orphaned locks cleaned: yes/no |
## Score: PASS / PARTIAL / FAIL
Priority Targeting
Run this audit FIRST if:
- Users report "my changes disappeared"
- Duplicate records appear in the database
- Billing shows double charges
- Background jobs produce duplicate outputs
- The system has multiple workers processing the same queue
- Any endpoint can be called concurrently by design (webhooks, APIs)
Install this skill directly: skilldb add production-audit-skills