Skip to main content
Film & TelevisionProduction Audit408 lines

Concurrency & Race Condition Audit

Quick Summary33 lines
Verify that the system behaves correctly when multiple operations happen simultaneously. Race conditions are among the hardest bugs to detect in testing and the most damaging in production. They cause data corruption, duplicate records, lost updates, and security breaches.

## Key Points

1. Navigate to a page with a "Generate" or "Submit" button.
2. Click the button twice in rapid succession (< 200ms apart).
3. Alternatively, use the API to send two identical POST requests simultaneously:
4. Inspect results.
- [ ] Only one job/record is created.
- [ ] Second request receives: already-in-progress response, or same job ID.
- [ ] UI button is disabled after first click (optimistic guard).
- [ ] No duplicate entries in any table.
1. Open a resource (project, document, settings) in Tab A.
2. Open the same resource in Tab B.
3. In Tab A, change field X and save.
4. In Tab B (still showing old data), change field Y and save.

## Quick Example

```bash
curl -X POST /api/generate -d '{"project_id": "123"}' &
   curl -X POST /api/generate -d '{"project_id": "123"}' &
   wait
```

```
[ ] UI: Button disabled on click, re-enabled on response/timeout
[ ] API: Idempotency key required on create endpoints
[ ] API: Mutex check before job creation (SELECT FOR UPDATE or equivalent)
[ ] DB: Unique constraint on (entity_id, operation_type, status='active')
```
skilldb get production-audit-skills/concurrency-race-condition-auditFull skill: 408 lines
Paste into your CLAUDE.md or agent config

Concurrency & Race Condition Audit

Purpose

Verify that the system behaves correctly when multiple operations happen simultaneously. Race conditions are among the hardest bugs to detect in testing and the most damaging in production. They cause data corruption, duplicate records, lost updates, and security breaches.

This audit systematically tests every path where two operations can collide.


Scope

Concurrency ScenarioWhat We Test
Two users editing same resourceLast-write-wins, merge conflicts, data loss
Two jobs processing same itemDuplicate output, corrupted state
Double-click / double-submitDuplicate records, double billing
Two tabs, same userSession conflicts, stale data overwrites
Delayed job completing after newer oneOut-of-order writes, stale data surfacing
Bulk operation + individual editConflicting writes, partial visibility
Read-modify-write cyclesLost updates, inconsistent derived state

Risk Pattern Table

PatternWhat It HitsRiskSymptom
Read-modify-write without lockDataCRITICALTwo users read version 1, both write "version 2", one update lost
Double-submit without dedupeAPI, DataHIGHTwo records created, double charge, double email
Last-write-wins without conflict detectionDataHIGHUser A's changes silently overwritten by User B
Out-of-order job completionData, UIMEDIUMOld result overwrites newer result; stale data displayed
Non-atomic compound operationDataHIGHTwo-step operation (create + link) interrupted between steps
Cache not invalidated on writeData, UIMEDIUMUser sees stale data after another user's update
Shared mutable state in workersJobsHIGHTwo workers modify same in-memory structure; corruption
Transaction isolation too lowDBHIGHDirty reads, phantom reads, non-repeatable reads
File write without lockingStorageMEDIUMTwo processes write same file; corrupted output
Counter increment without atomic operationDataMEDIUMTwo increments: expected +2, actual +1 (lost update)

Concrete Test Cases

TEST-RC-001: Double-Click Generate / Submit

Objective: Verify that rapidly clicking a trigger button does not create duplicate work.

Steps:

  1. Navigate to a page with a "Generate" or "Submit" button.
  2. Click the button twice in rapid succession (< 200ms apart).
  3. Alternatively, use the API to send two identical POST requests simultaneously:
    curl -X POST /api/generate -d '{"project_id": "123"}' &
    curl -X POST /api/generate -d '{"project_id": "123"}' &
    wait
    
  4. Inspect results.

Pass Criteria:

  • Only one job/record is created.
  • Second request receives: already-in-progress response, or same job ID.
  • UI button is disabled after first click (optimistic guard).
  • No duplicate entries in any table.

Implementation Verification:

[ ] UI: Button disabled on click, re-enabled on response/timeout
[ ] API: Idempotency key required on create endpoints
[ ] API: Mutex check before job creation (SELECT FOR UPDATE or equivalent)
[ ] DB: Unique constraint on (entity_id, operation_type, status='active')

TEST-RC-002: Two-Tab Same Resource Editing

Objective: Verify that editing the same resource in two browser tabs does not silently lose changes.

Steps:

  1. Open a resource (project, document, settings) in Tab A.
  2. Open the same resource in Tab B.
  3. In Tab A, change field X and save.
  4. In Tab B (still showing old data), change field Y and save.
  5. Reload the resource.

Pass Criteria (one of):

  • Optimistic locking: Tab B's save is rejected with "Resource was modified. Please refresh."
  • Field-level merge: Both changes are preserved (X from Tab A, Y from Tab B).
  • Real-time sync: Tab B auto-updates when Tab A saves.

Fail Criteria:

  • Tab B's save silently overwrites Tab A's changes (last-write-wins without warning).
  • Both saves "succeed" but only one is persisted.
  • No version tracking; overwrites are undetectable.

Implementation Check:

Optimistic locking pattern:
1. Read: GET /resource/123 -> { data: {...}, version: 5 }
2. Write: PUT /resource/123 { data: {...}, version: 5 }
3. Server: IF current_version != 5 THEN reject with 409 Conflict
4. Server: IF current_version == 5 THEN update, set version = 6

[ ] Version field exists on all mutable resources
[ ] Update endpoint checks version match
[ ] 409 Conflict response handled in UI with clear message
[ ] UI prompts user to refresh and re-apply changes

TEST-RC-003: Two Users Same Project Simultaneously

Objective: Verify that concurrent access by different users does not cause data corruption.

Steps:

  1. User A and User B both have access to Project X.
  2. User A adds Asset 1 to the project.
  3. Simultaneously, User B adds Asset 2 to the project.
  4. Both save.
  5. Reload: verify both assets are present.

Pass Criteria:

  • Both assets are present after both saves complete.
  • No data corruption in project metadata.
  • Audit log shows both users' actions distinctly.
  • If conflict, at least one user is notified.

Test Variations:

  • Both users edit the SAME field -> conflict detection required.
  • Both users add to a COLLECTION -> both should succeed (no conflict).
  • One user deletes while another edits -> clear error for the editor.

TEST-RC-004: Delayed Job Finishing After Newer One

Objective: Verify that a slow old job completing does not overwrite a newer job's results.

Steps:

  1. Start Job A for Asset X (generation/processing).
  2. Job A takes unusually long (simulate with delay).
  3. User triggers Job B for the same Asset X (e.g., "regenerate").
  4. Job B completes first with Result B.
  5. Job A finally completes with Result A (now stale).
  6. Check: which result is stored for Asset X?

Pass Criteria:

  • Result B (newer) is the active result for Asset X.
  • Result A is either discarded or stored as a previous version (not active).
  • The UI shows Result B, not Result A.
  • Job A detects it was superseded and does not overwrite.

Implementation Check:

[ ] Job writes check: "Am I still the latest job for this resource?"
[ ] Version/sequence number compared before write
[ ] Superseded jobs are cancelled or their writes are no-ops
[ ] Write condition: WHERE version = expected_version or WHERE job_id = latest_job_id

TEST-RC-005: Concurrent Bulk + Individual Operation

Objective: Verify that a bulk operation and individual edit on overlapping resources do not conflict.

Steps:

  1. Start a bulk operation on 20 items (e.g., "regenerate all assets in project").
  2. While bulk is processing, individually edit one of those 20 items.
  3. Wait for both to complete.
  4. Inspect the individually edited item.

Pass Criteria:

  • Individual edit takes precedence (user intent is more specific).
  • OR bulk operation skips items being individually edited.
  • OR conflict is surfaced to user clearly.
  • No corrupted state: item is in ONE consistent state.

TEST-RC-006: Counter / Aggregate Consistency

Objective: Verify that counters and aggregates remain accurate under concurrent modification.

Steps:

  1. Check a counter value (e.g., project.asset_count = 10).
  2. Simultaneously add 3 assets from different sources (API, bulk, job).
  3. After all complete, check counter value.

Pass Criteria:

  • Counter = 13 (10 + 3). Not 11, not 12.
  • Atomic increment used (not read-increment-write).
  • OR counter is derived (COUNT query) rather than stored.

Implementation Check:

-- BAD: read-modify-write (race condition)
SELECT asset_count FROM projects WHERE id = 1;  -- returns 10
UPDATE projects SET asset_count = 11 WHERE id = 1;  -- two processes both write 11

-- GOOD: atomic increment
UPDATE projects SET asset_count = asset_count + 1 WHERE id = 1;

-- BEST: derived count (no counter to drift)
SELECT COUNT(*) FROM assets WHERE project_id = 1;

TEST-RC-007: Distributed Lock Verification

Objective: Verify that distributed locks work correctly and do not deadlock.

Steps:

  1. Identify all places where locks are used (DB row locks, Redis locks, mutex).
  2. For each lock, verify:
    • Lock has a TTL (time-to-live) to prevent permanent deadlock.
    • Lock is released on both success and failure.
    • Lock holder ID is tracked (to prevent accidental release by another process).
  3. Test: acquire lock, simulate crash (do not release), verify lock auto-expires.

Pass Criteria:

  • All locks have TTL configured.
  • Lock TTL is shorter than the operation's timeout.
  • Orphaned locks are automatically cleaned up.
  • Lock acquisition failures return clear errors (not hangs).
  • No deadlock potential (locks always acquired in consistent order).

Lock Audit Template:

| Lock Name | Type | TTL | Scope | Auto-Release | Deadlock Risk |
|-----------|------|-----|-------|-------------|---------------|
| job_lock  | Redis | 300s | per-job | on crash: yes (TTL) | LOW |
| edit_lock | DB row | 60s | per-resource | on crash: yes (transaction rollback) | LOW |
| queue_lock | Redis | 30s | per-queue | on crash: yes (TTL) | MEDIUM (if nested) |

Dedupe Key Patterns

For API Requests

Idempotency key: client-generated UUID sent in header
X-Idempotency-Key: 550e8400-e29b-41d4-a716-446655440000

Server behavior:
1. Check if key exists in idempotency store
2. If exists: return cached response (do not re-execute)
3. If not: execute, store response keyed by idempotency key
4. Key expires after 24 hours

Storage: Redis with TTL, or DB table with cleanup job

For Background Jobs

Dedupe key: hash of (entity_id + operation_type + parameters)
Before enqueue: check if active job exists with same dedupe key
If exists: return existing job ID (do not enqueue duplicate)
If not: enqueue new job with dedupe key

Cleanup: dedupe key cleared when job reaches terminal state

For Webhooks / Callbacks

Dedupe key: webhook event ID (provided by sender)
On receive: check if event ID already processed
If processed: return 200 (acknowledge) without re-executing
If not: process, record event ID with timestamp
Retention: event IDs retained for 7 days minimum

Optimistic Locking Implementation Guide

1. Add `version` (integer) or `updated_at` (timestamp) to every mutable entity.

2. Every UPDATE includes the version check:
   UPDATE resources
   SET data = ?, version = version + 1
   WHERE id = ? AND version = ?;

   If rows_affected = 0: version conflict -> return 409.

3. API response always includes current version:
   { "id": "123", "data": {...}, "version": 5 }

4. Client sends version back on update:
   PUT /resources/123 { "data": {...}, "version": 5 }

5. UI handles 409 Conflict:
   - Show: "This resource was modified by another user. Refresh to see changes."
   - Optionally: show diff of changes.

Post-Audit Checklist

[ ] All create endpoints have idempotency key support
[ ] All update endpoints use optimistic locking (version check)
[ ] Double-click protection on all trigger buttons (UI)
[ ] Background jobs deduplicated by entity + operation
[ ] Out-of-order job completion handled (version/sequence check on write)
[ ] Counters use atomic operations or derived queries
[ ] Distributed locks have TTL and auto-cleanup
[ ] Webhook/callback processing is idempotent
[ ] No read-modify-write patterns without concurrency control
[ ] Race condition test suite exists and runs in CI

What Earlier Audits Miss

Standard testing runs operations sequentially. This audit matters because:

  • Unit tests execute one operation at a time. Race conditions are invisible in sequential execution.
  • Integration tests rarely send two requests simultaneously. The window for collision is milliseconds wide.
  • Code reviews catch missing locks in obvious places but miss subtle read-modify-write patterns buried in business logic.
  • QA testing uses one browser. Multi-user concurrent editing is never tested.
  • Load testing measures throughput and latency, not data correctness under concurrent writes.

This would be called a Concurrency & Race Condition Audit -- specifically testing whether the system produces correct, consistent results when multiple operations execute simultaneously on shared resources.


Automation Opportunities

TestAutomatable?Method
TEST-RC-001: Double-clickYESConcurrent curl requests; assert single resource created
TEST-RC-002: Two-tab editingPARTIALSelenium: open two tabs, edit same resource, assert conflict detection
TEST-RC-003: Two usersYESConcurrent API requests with different auth; assert both changes preserved
TEST-RC-004: Delayed jobYESMock: start old job, start new job, complete new first, complete old, assert newer wins
TEST-RC-005: Bulk + individualPARTIALConcurrent API calls; assert consistent final state
TEST-RC-006: Counter consistencyYESConcurrent increment requests; assert final count matches expected
TEST-RC-007: Lock verificationYESAcquire lock, simulate crash, assert lock auto-expires
# Automated race condition test: concurrent counter increment
INITIAL=$(curl -s /api/projects/123 | jq '.asset_count')
for i in $(seq 1 10); do
  curl -s -X POST /api/projects/123/assets -d '{"name": "asset-'$i'"}' &
done
wait
sleep 2  # Allow eventual consistency
FINAL=$(curl -s /api/projects/123 | jq '.asset_count')
EXPECTED=$((INITIAL + 10))
[ "$FINAL" -eq "$EXPECTED" ] && echo "PASS: count=$FINAL" || echo "FAIL: expected=$EXPECTED got=$FINAL"

Reusable Audit Report Template

# Concurrency & Race Condition Audit Report

## System: _______________
## Date: YYYY-MM-DD
## Auditor: _______________

## Concurrent Access Points Identified
| Resource | Concurrent Access Pattern | Protection | Verdict |
|----------|-------------------------|------------|---------|
| ___ | Two users editing | Optimistic lock / None | PASS/FAIL |

## Test Results
| Test ID | Description | Result | Evidence |
|---------|-------------|--------|----------|
| TEST-RC-001 | Double-click | PASS/FAIL | Duplicates created: ___ |
| TEST-RC-002 | Two-tab editing | PASS/FAIL | Conflict detected: yes/no |
| TEST-RC-003 | Two users | PASS/FAIL | Data lost: yes/no |
| TEST-RC-004 | Delayed job | PASS/FAIL | Stale data surfaced: yes/no |
| TEST-RC-005 | Bulk + individual | PASS/FAIL | Consistent state: yes/no |
| TEST-RC-006 | Counter consistency | PASS/FAIL | Expected: ___, actual: ___ |
| TEST-RC-007 | Lock verification | PASS/FAIL | Orphaned locks cleaned: yes/no |

## Score: PASS / PARTIAL / FAIL

Priority Targeting

Run this audit FIRST if:

  • Users report "my changes disappeared"
  • Duplicate records appear in the database
  • Billing shows double charges
  • Background jobs produce duplicate outputs
  • The system has multiple workers processing the same queue
  • Any endpoint can be called concurrently by design (webhooks, APIs)

Install this skill directly: skilldb add production-audit-skills

Get CLI access →