Skip to main content
Film & TelevisionProduction Audit491 lines

State Machine Audit

Quick Summary36 lines
Verify that workflow states are explicitly defined, transitions are validated, impossible states are prevented, and the system never gets stuck in an unrecoverable state. State management bugs are insidious: they cause jobs to hang forever, UIs to show contradictory information, and operators to resort to manual database edits.

## Key Points

1. Search codebase for all status/state assignments and comparisons.
2. Extract unique values.
3. Compare against the documented state list.
- [ ] Every status value found in code exists in the enum definition.
- [ ] No "magic strings" used for status (all reference the enum).
- [ ] No unused states in the enum (dead code).
- [ ] State enum is the single source of truth.
1. For each non-terminal state, trace all possible paths forward.
2. Verify each path reaches a terminal state within finite steps.
3. Verify timeout/fallback mechanisms for states that depend on external input.
- [ ] From "queued": can reach completed, failed, or cancelled.
- [ ] From "processing": can reach completed, failed, cancelled, or retrying (which loops back).

## Quick Example

```
[ ] Transition function validates current state before applying new state
[ ] Invalid transitions throw errors (not silently succeed)
[ ] State changes are atomic (DB transaction)
[ ] State change includes timestamp (state_changed_at or updated_at)
[ ] State change is logged (transition log table or audit log)
```

```
[ ] Database constraints enforce impossible-state rules
    - CHECK constraints on state + required fields
    - Unique partial indexes (e.g., one active job per entity)
[ ] Application-level validation before state change
[ ] Post-transition assertions (verify invariants after every state change)
```
skilldb get production-audit-skills/state-machine-auditFull skill: 491 lines
Paste into your CLAUDE.md or agent config

State Machine Audit

Purpose

Verify that workflow states are explicitly defined, transitions are validated, impossible states are prevented, and the system never gets stuck in an unrecoverable state. State management bugs are insidious: they cause jobs to hang forever, UIs to show contradictory information, and operators to resort to manual database edits.

This audit ensures that every entity with a lifecycle has a well-defined, enforceable state machine.


Scope

AreaWhat We Test
State definitionAre all possible states explicitly enumerated?
TransitionsAre valid transitions defined and enforced?
Terminal statesCan every workflow reach a terminal state?
Impossible statesCan the system enter contradictory states?
Timeout / stuck detectionAre non-terminal states monitored for staleness?
UI accuracyDoes the UI accurately reflect backend state?
Error statesAre failures distinguishable and recoverable?

Risk Pattern Table

PatternWhat It HitsRiskSymptom
Stringly-typed statesCode qualityHIGHTypo in status string causes silent bug; "procesing" != "processing"
No transition validationData integrityHIGHJob goes from "completed" back to "processing"
Missing terminal state reachabilityReliabilityCRITICALJob stuck in "processing" forever; no timeout, no cleanup
Contradictory compound stateUI, LogicHIGHAsset is "completed" but has no output file
Implicit statesMaintainabilityMEDIUMState derived from multiple fields; hard to reason about
Missing "partial success"UX, DataHIGH18/20 items succeed but batch marked "failed"
No distinction between "pending" and "stuck"OperationsHIGHCannot tell if job hasn't started vs is hung
UI shows stale stateUXMEDIUMUser sees "processing" but job finished minutes ago
State updated without transition logAuditabilityMEDIUMCannot reconstruct how entity reached current state
Orphaned child statesData integrityHIGHParent cancelled but children still processing

Enumeration Methodology

Step 1: Identify All Stateful Entities

List every entity in the system that has a lifecycle (status, state, phase):

| Entity | State Field | Storage | Current States |
|--------|------------|---------|---------------|
| Job | status | DB: jobs.status | queued, processing, completed, failed |
| Asset | status | DB: assets.status | pending, generating, ready, error |
| Project | status | DB: projects.status | draft, active, archived |
| Payment | status | DB: payments.status | pending, charged, refunded, failed |
| Upload | status | DB: uploads.status | uploading, processing, ready, failed |
| Invitation | status | DB: invitations.status | sent, accepted, expired, revoked |

Step 2: Verify State Definition Quality

For each entity, check:

[ ] States are defined as an enum (not free-form strings)
[ ] Every state is documented with its meaning
[ ] States are mutually exclusive (entity is in exactly one state)
[ ] No "compound states" derived from multiple boolean fields
    BAD:  is_processing=true, is_complete=false, has_error=false
    GOOD: status = 'processing'  (single field, enum value)

Step 3: Map Valid Transitions

For each entity, create a transition matrix:

Job Status Transition Matrix:

FROM \ TO    | queued | processing | completed | failed | cancelled | retrying |
-------------|--------|------------|-----------|--------|-----------|----------|
(initial)    |   Y    |     N      |     N     |   N    |     N     |    N     |
queued       |   -    |     Y      |     N     |   Y    |     Y     |    N     |
processing   |   N    |     -      |     Y     |   Y    |     Y     |    Y     |
completed    |   N    |     N      |     -     |   N    |     N     |    N     |
failed       |   N    |     N      |     N     |   -    |     N     |    Y     |
cancelled    |   N    |     N      |     N     |   N    |     -     |    N     |
retrying     |   Y    |     Y      |     N     |   Y    |     Y     |    -     |

Terminal states: completed, failed (after max retries), cancelled

Step 4: Verify Transition Enforcement

Check that invalid transitions are rejected at the code level:

[ ] Transition function validates current state before applying new state
[ ] Invalid transitions throw errors (not silently succeed)
[ ] State changes are atomic (DB transaction)
[ ] State change includes timestamp (state_changed_at or updated_at)
[ ] State change is logged (transition log table or audit log)

Concrete Test Cases

TEST-SM-001: Enumerate All States in Codebase

Objective: Verify that states used in code match the documented/intended states.

Steps:

  1. Search codebase for all status/state assignments and comparisons.
  2. Extract unique values.
  3. Compare against the documented state list.
# Find all state assignments
grep -rn "status.*=.*['\"]" --include="*.ts" --include="*.js" --include="*.py"
grep -rn "state.*=.*['\"]" --include="*.ts" --include="*.js" --include="*.py"

# Find all state comparisons
grep -rn "status.*===\|status.*==" --include="*.ts" --include="*.js"
grep -rn "status.*==\|state.*==" --include="*.py"

Pass Criteria:

  • Every status value found in code exists in the enum definition.
  • No "magic strings" used for status (all reference the enum).
  • No unused states in the enum (dead code).
  • State enum is the single source of truth.

TEST-SM-002: Terminal State Reachability

Objective: Verify that every non-terminal state can eventually reach a terminal state.

Steps:

  1. For each non-terminal state, trace all possible paths forward.
  2. Verify each path reaches a terminal state within finite steps.
  3. Verify timeout/fallback mechanisms for states that depend on external input.

Pass Criteria:

  • From "queued": can reach completed, failed, or cancelled.
  • From "processing": can reach completed, failed, cancelled, or retrying (which loops back).
  • From "retrying": can reach queued (re-enqueue), failed (max retries), or cancelled.
  • No state exists that can only transition to itself.
  • Timeout exists for every non-terminal state:
    • queued -> failed (if not picked up within X minutes)
    • processing -> failed (if no heartbeat within X minutes)
    • retrying -> failed (if max retries exceeded)

Stuck Detection Query:

-- Find entities stuck in non-terminal states
SELECT entity_type, status, COUNT(*),
       MIN(updated_at) as oldest,
       MAX(EXTRACT(EPOCH FROM (NOW() - updated_at))) as max_age_seconds
FROM (
  SELECT 'job' as entity_type, status, updated_at FROM jobs
  WHERE status NOT IN ('completed', 'failed', 'cancelled')
  UNION ALL
  SELECT 'asset', status, updated_at FROM assets
  WHERE status NOT IN ('ready', 'error', 'deleted')
) stuck
GROUP BY entity_type, status
HAVING MAX(EXTRACT(EPOCH FROM (NOW() - updated_at))) > 600;

TEST-SM-003: Impossible State Prevention

Objective: Verify that contradictory states cannot exist in the database.

Test Cases:

Impossible StateHow to AttemptExpected Result
Job "completed" with no outputDelete output, try to mark completeTransition rejected
Job "processing" with no worker assignedDirectly set status via API/DBValidation error or worker assignment enforced
Asset "ready" with no file URLSet status without fileConstraint violation
Two jobs "processing" for same entityStart second job while first runsSecond blocked or first cancelled
Parent "completed" with children "processing"Complete parent before childrenTransition blocked until children are terminal
Entity "deleted" but still queryableSoft-delete, then queryExcluded from normal queries

Implementation Verification:

[ ] Database constraints enforce impossible-state rules
    - CHECK constraints on state + required fields
    - Unique partial indexes (e.g., one active job per entity)
[ ] Application-level validation before state change
[ ] Post-transition assertions (verify invariants after every state change)

TEST-SM-004: UI State Accuracy

Objective: Verify that the UI shows state that matches the backend.

Steps:

  1. For each state, put an entity into that state (via API or DB).
  2. Load the UI and verify the displayed status.
  3. Check that UI elements match:
    • Status badge/label
    • Available actions (buttons enabled/disabled)
    • Progress indicators
    • Error messages

State-to-UI Mapping:

Backend StateUI LabelActions AvailableProgressColor
queued"Waiting..."CancelIndeterminate spinnerGray
processing"Generating..."CancelProgress bar (if known)Blue
completed"Ready"View, Download, RegenerateHiddenGreen
failed"Failed"Retry, View ErrorHiddenRed
cancelled"Cancelled"RestartHiddenGray
partial"Partially Complete"Retry Failed, View CompletedX/Y progressOrange

Pass Criteria:

  • Every backend state has a corresponding UI representation.
  • No UI state that doesn't exist in the backend (e.g., UI-only "loading" ≠ backend state).
  • Actions are correctly gated by state (no "Download" button on failed items).
  • Real-time or polling updates move UI state forward without manual refresh.

TEST-SM-005: State Transition Logging

Objective: Verify that state transitions are recorded for debugging and auditing.

Steps:

  1. Walk an entity through its complete lifecycle.
  2. Check that each transition is recorded.

Expected Log:

| Timestamp           | Entity  | Entity ID | From       | To         | Trigger    | Actor    |
|---------------------|---------|-----------|------------|------------|------------|----------|
| 2024-01-15 10:00:00 | job     | job_123   | (created)  | queued     | user_submit| user_456 |
| 2024-01-15 10:00:05 | job     | job_123   | queued     | processing | worker_pick| worker_1 |
| 2024-01-15 10:01:30 | job     | job_123   | processing | failed     | provider_error | worker_1 |
| 2024-01-15 10:01:35 | job     | job_123   | failed     | retrying   | auto_retry | system   |
| 2024-01-15 10:01:40 | job     | job_123   | retrying   | queued     | re_enqueue | system   |
| 2024-01-15 10:01:45 | job     | job_123   | queued     | processing | worker_pick| worker_2 |
| 2024-01-15 10:02:50 | job     | job_123   | processing | completed  | success    | worker_2 |

Pass Criteria:

  • Every transition is logged with: timestamp, from, to, trigger, actor.
  • Transition log is queryable per entity (reconstruct full lifecycle).
  • No state changes happen without a log entry.
  • Log entries are immutable (append-only).

TEST-SM-006: Child State Consistency

Objective: Verify that parent and child entity states remain consistent.

Steps:

  1. Cancel a parent entity (project/batch) while children are processing.
  2. Verify all children are moved to a terminal state.
  3. Start processing on a parent. Complete some children, fail others.
  4. Verify parent state reflects child summary.

Rules:

Parent state derivation:
- All children pending     -> parent: pending
- Any child processing     -> parent: processing
- All children completed   -> parent: completed
- All children terminal, any failed -> parent: partial (or failed if all failed)
- Parent cancelled         -> all non-terminal children: cancelled

[ ] Parent state is derived from children (not independently set)
[ ] Parent cancellation cascades to children
[ ] Child completion triggers parent state recalculation
[ ] No orphaned children (parent deleted but children remain)

State Bug vs Code Bug: How to Distinguish

SituationState Bug?Code Bug?How to Tell
Job stuck in "processing" foreverYESMaybeCheck: is there a timeout? Is the worker alive?
Job shows "completed" but output is wrongNOYESState is correct; the processing logic has a defect
Job shows "failed" but output actually existsYESMaybeState transition happened before output check
UI shows "processing" but backend says "completed"YES (UI)NOUI polling/websocket is broken; backend is correct
Two jobs both "processing" for same entityYESYESMissing uniqueness constraint AND missing lock
Entity has status "procesing" (typo)YESYESString-based states without enum validation

State Machine Implementation Template

// Type-safe state machine with transition validation
enum JobStatus {
  Queued = 'queued',
  Processing = 'processing',
  Completed = 'completed',
  Failed = 'failed',
  Cancelled = 'cancelled',
  Retrying = 'retrying',
}

const VALID_TRANSITIONS: Record<JobStatus, JobStatus[]> = {
  [JobStatus.Queued]:     [JobStatus.Processing, JobStatus.Failed, JobStatus.Cancelled],
  [JobStatus.Processing]: [JobStatus.Completed, JobStatus.Failed, JobStatus.Cancelled, JobStatus.Retrying],
  [JobStatus.Completed]:  [],  // terminal
  [JobStatus.Failed]:     [JobStatus.Retrying],
  [JobStatus.Cancelled]:  [],  // terminal
  [JobStatus.Retrying]:   [JobStatus.Queued, JobStatus.Failed, JobStatus.Cancelled],
};

function transitionJob(job: Job, newStatus: JobStatus): void {
  const allowed = VALID_TRANSITIONS[job.status];
  if (!allowed.includes(newStatus)) {
    throw new InvalidTransitionError(
      `Cannot transition job ${job.id} from ${job.status} to ${newStatus}`
    );
  }
  // Log transition
  await logTransition(job.id, job.status, newStatus);
  // Update with optimistic lock
  const updated = await db.jobs.update({
    where: { id: job.id, status: job.status },  // optimistic lock
    data: { status: newStatus, updated_at: new Date() },
  });
  if (!updated) throw new ConcurrentModificationError();
}

Post-Audit Checklist

[ ] All stateful entities identified and documented
[ ] States defined as enums (not strings)
[ ] Valid transitions defined in code (transition matrix)
[ ] Invalid transitions rejected with errors
[ ] Terminal states identified; every state can reach one
[ ] Timeout detection for non-terminal states
[ ] Parent-child state consistency enforced
[ ] UI accurately reflects backend state for all states
[ ] State transitions logged with timestamp, from, to, trigger, actor
[ ] Impossible states prevented by DB constraints and app logic
[ ] No compound boolean states; single status field per entity
[ ] Stuck entity alerting configured

What Earlier Audits Miss

Standard testing verifies happy-path state transitions. This audit matters because:

  • Unit tests validate individual transitions but miss reachability gaps (states that can be entered but never exited).
  • Integration tests rarely test impossible state combinations (e.g., "completed" with zero outputs).
  • Code reviews catch individual state assignments but miss that new states were added without updating the transition matrix.
  • QA testing follows scripted flows and never puts the system into states caused by worker crashes, network partitions, or partial failures.
  • Monitoring tracks error rates but not stuck-state accumulation. Jobs silently pile up in "processing" without anyone noticing.

This would be called a State Machine Audit -- specifically testing whether all entity states are explicitly defined, transitions are validated, and every non-terminal state can reach a terminal state under normal operation, failure conditions, and concurrent access.


Extended Transition Matrix Examples

Asset Status Transitions

FROM \ TO      | pending | generating | ready | error | deleted | regenerating |
---------------|---------|------------|-------|-------|---------|--------------|
(initial)      |    Y    |     N      |   N   |   N   |    N    |      N       |
pending        |    -    |     Y      |   N   |   Y   |    Y    |      N       |
generating     |    N    |     -      |   Y   |   Y   |    N    |      N       |
ready          |    N    |     N      |   -   |   N   |    Y    |      Y       |
error          |    N    |     Y      |   N   |   -   |    Y    |      N       |
deleted        |    N    |     N      |   N   |   N   |    -    |      N       |
regenerating   |    N    |     N      |   Y   |   Y   |    N    |      -       |

Terminal states: deleted
Resettable terminal: error (can retry -> generating)

Payment Status Transitions

FROM \ TO      | pending | authorized | captured | refunded | failed | disputed |
---------------|---------|------------|----------|----------|--------|----------|
(initial)      |    Y    |     N      |    N     |    N     |   N    |    N     |
pending        |    -    |     Y      |    N     |    N     |   Y    |    N     |
authorized     |    N    |     -      |    Y     |    Y     |   Y    |    N     |
captured       |    N    |     N      |    -     |    Y     |   N    |    Y     |
refunded       |    N    |     N      |    N     |    -     |   N    |    N     |
failed         |    N    |     N      |    N     |    N     |   -    |    N     |
disputed       |    N    |     N      |    Y     |    Y     |   N    |    -     |

Terminal states: refunded, failed

Automation Opportunities

TestAutomatable?Method
TEST-SM-001: Enumerate statesYESStatic analysis: grep for status assignments, compare against enum
TEST-SM-002: Terminal reachabilityYESGraph traversal test on transition matrix; assert no dead-end non-terminal states
TEST-SM-003: Impossible statesPARTIALDB constraint validation + scheduled SQL integrity checks
TEST-SM-004: UI state accuracyMANUALVisual inspection of each state rendering
TEST-SM-005: Transition loggingYESIntegration test: walk lifecycle, assert all transitions logged
TEST-SM-006: Child consistencyYESIntegration test: cancel parent, assert children terminal
# Automated state enumeration check
# Find all unique status values used in code
grep -rohn "status.*=.*['\"]\\([a-z_]*\\)['\"]" src/ | \
  sed "s/.*['\"]\\([a-z_]*\\)['\"].*/\\1/" | sort -u > used_states.txt

# Compare against defined enum
grep -o "[A-Za-z]* = '[a-z_]*'" src/types/status.ts | \
  sed "s/.* = '\\(.*\\)'/\\1/" | sort -u > defined_states.txt

diff defined_states.txt used_states.txt
# Any lines in used but not defined = undocumented state (FAIL)

Reusable Audit Report Template

# State Machine Audit Report

## System: _______________
## Date: YYYY-MM-DD
## Auditor: _______________

## Stateful Entities Identified
| Entity | State Field | States Found | States Documented | Enum? |
|--------|------------|-------------|-------------------|-------|
| Job | status | ___ | ___ | [ ] |
| Asset | status | ___ | ___ | [ ] |

## Transition Matrix Verification
| Entity | Valid Transitions Defined? | Enforced in Code? | Logged? |
|--------|--------------------------|-------------------|---------|
| Job | PASS/FAIL | PASS/FAIL | PASS/FAIL |
| Asset | PASS/FAIL | PASS/FAIL | PASS/FAIL |

## Test Results
| Test ID | Description | Result | Evidence |
|---------|-------------|--------|----------|
| TEST-SM-001 | State enumeration | PASS/FAIL | ___ states found vs ___ defined |
| TEST-SM-002 | Terminal reachability | PASS/FAIL | Unreachable terminals: ___ |
| TEST-SM-003 | Impossible states | PASS/FAIL | ___ impossible states achievable |
| TEST-SM-004 | UI accuracy | PASS/FAIL | ___ states misrepresented |
| TEST-SM-005 | Transition logging | PASS/FAIL | ___ transitions unlogged |
| TEST-SM-006 | Child consistency | PASS/FAIL | Orphaned children after cancel: ___ |

## Score: PASS / PARTIAL / FAIL

Priority Targeting

Run this audit FIRST if:

  • Operators regularly fix stuck jobs by updating the database directly
  • The UI shows contradictory information ("Completed" with no output)
  • Status values are stored as strings without validation
  • New states have been added without updating transition rules
  • Parent entities can be in a state that contradicts their children

Install this skill directly: skilldb add production-audit-skills

Get CLI access →