State Machine Audit
Verify that workflow states are explicitly defined, transitions are validated, impossible states are prevented, and the system never gets stuck in an unrecoverable state. State management bugs are insidious: they cause jobs to hang forever, UIs to show contradictory information, and operators to resort to manual database edits.
## Key Points
1. Search codebase for all status/state assignments and comparisons.
2. Extract unique values.
3. Compare against the documented state list.
- [ ] Every status value found in code exists in the enum definition.
- [ ] No "magic strings" used for status (all reference the enum).
- [ ] No unused states in the enum (dead code).
- [ ] State enum is the single source of truth.
1. For each non-terminal state, trace all possible paths forward.
2. Verify each path reaches a terminal state within finite steps.
3. Verify timeout/fallback mechanisms for states that depend on external input.
- [ ] From "queued": can reach completed, failed, or cancelled.
- [ ] From "processing": can reach completed, failed, cancelled, or retrying (which loops back).
## Quick Example
```
[ ] Transition function validates current state before applying new state
[ ] Invalid transitions throw errors (not silently succeed)
[ ] State changes are atomic (DB transaction)
[ ] State change includes timestamp (state_changed_at or updated_at)
[ ] State change is logged (transition log table or audit log)
```
```
[ ] Database constraints enforce impossible-state rules
- CHECK constraints on state + required fields
- Unique partial indexes (e.g., one active job per entity)
[ ] Application-level validation before state change
[ ] Post-transition assertions (verify invariants after every state change)
```skilldb get production-audit-skills/state-machine-auditFull skill: 491 linesState Machine Audit
Purpose
Verify that workflow states are explicitly defined, transitions are validated, impossible states are prevented, and the system never gets stuck in an unrecoverable state. State management bugs are insidious: they cause jobs to hang forever, UIs to show contradictory information, and operators to resort to manual database edits.
This audit ensures that every entity with a lifecycle has a well-defined, enforceable state machine.
Scope
| Area | What We Test |
|---|---|
| State definition | Are all possible states explicitly enumerated? |
| Transitions | Are valid transitions defined and enforced? |
| Terminal states | Can every workflow reach a terminal state? |
| Impossible states | Can the system enter contradictory states? |
| Timeout / stuck detection | Are non-terminal states monitored for staleness? |
| UI accuracy | Does the UI accurately reflect backend state? |
| Error states | Are failures distinguishable and recoverable? |
Risk Pattern Table
| Pattern | What It Hits | Risk | Symptom |
|---|---|---|---|
| Stringly-typed states | Code quality | HIGH | Typo in status string causes silent bug; "procesing" != "processing" |
| No transition validation | Data integrity | HIGH | Job goes from "completed" back to "processing" |
| Missing terminal state reachability | Reliability | CRITICAL | Job stuck in "processing" forever; no timeout, no cleanup |
| Contradictory compound state | UI, Logic | HIGH | Asset is "completed" but has no output file |
| Implicit states | Maintainability | MEDIUM | State derived from multiple fields; hard to reason about |
| Missing "partial success" | UX, Data | HIGH | 18/20 items succeed but batch marked "failed" |
| No distinction between "pending" and "stuck" | Operations | HIGH | Cannot tell if job hasn't started vs is hung |
| UI shows stale state | UX | MEDIUM | User sees "processing" but job finished minutes ago |
| State updated without transition log | Auditability | MEDIUM | Cannot reconstruct how entity reached current state |
| Orphaned child states | Data integrity | HIGH | Parent cancelled but children still processing |
Enumeration Methodology
Step 1: Identify All Stateful Entities
List every entity in the system that has a lifecycle (status, state, phase):
| Entity | State Field | Storage | Current States |
|--------|------------|---------|---------------|
| Job | status | DB: jobs.status | queued, processing, completed, failed |
| Asset | status | DB: assets.status | pending, generating, ready, error |
| Project | status | DB: projects.status | draft, active, archived |
| Payment | status | DB: payments.status | pending, charged, refunded, failed |
| Upload | status | DB: uploads.status | uploading, processing, ready, failed |
| Invitation | status | DB: invitations.status | sent, accepted, expired, revoked |
Step 2: Verify State Definition Quality
For each entity, check:
[ ] States are defined as an enum (not free-form strings)
[ ] Every state is documented with its meaning
[ ] States are mutually exclusive (entity is in exactly one state)
[ ] No "compound states" derived from multiple boolean fields
BAD: is_processing=true, is_complete=false, has_error=false
GOOD: status = 'processing' (single field, enum value)
Step 3: Map Valid Transitions
For each entity, create a transition matrix:
Job Status Transition Matrix:
FROM \ TO | queued | processing | completed | failed | cancelled | retrying |
-------------|--------|------------|-----------|--------|-----------|----------|
(initial) | Y | N | N | N | N | N |
queued | - | Y | N | Y | Y | N |
processing | N | - | Y | Y | Y | Y |
completed | N | N | - | N | N | N |
failed | N | N | N | - | N | Y |
cancelled | N | N | N | N | - | N |
retrying | Y | Y | N | Y | Y | - |
Terminal states: completed, failed (after max retries), cancelled
Step 4: Verify Transition Enforcement
Check that invalid transitions are rejected at the code level:
[ ] Transition function validates current state before applying new state
[ ] Invalid transitions throw errors (not silently succeed)
[ ] State changes are atomic (DB transaction)
[ ] State change includes timestamp (state_changed_at or updated_at)
[ ] State change is logged (transition log table or audit log)
Concrete Test Cases
TEST-SM-001: Enumerate All States in Codebase
Objective: Verify that states used in code match the documented/intended states.
Steps:
- Search codebase for all status/state assignments and comparisons.
- Extract unique values.
- Compare against the documented state list.
# Find all state assignments
grep -rn "status.*=.*['\"]" --include="*.ts" --include="*.js" --include="*.py"
grep -rn "state.*=.*['\"]" --include="*.ts" --include="*.js" --include="*.py"
# Find all state comparisons
grep -rn "status.*===\|status.*==" --include="*.ts" --include="*.js"
grep -rn "status.*==\|state.*==" --include="*.py"
Pass Criteria:
- Every status value found in code exists in the enum definition.
- No "magic strings" used for status (all reference the enum).
- No unused states in the enum (dead code).
- State enum is the single source of truth.
TEST-SM-002: Terminal State Reachability
Objective: Verify that every non-terminal state can eventually reach a terminal state.
Steps:
- For each non-terminal state, trace all possible paths forward.
- Verify each path reaches a terminal state within finite steps.
- Verify timeout/fallback mechanisms for states that depend on external input.
Pass Criteria:
- From "queued": can reach completed, failed, or cancelled.
- From "processing": can reach completed, failed, cancelled, or retrying (which loops back).
- From "retrying": can reach queued (re-enqueue), failed (max retries), or cancelled.
- No state exists that can only transition to itself.
- Timeout exists for every non-terminal state:
- queued -> failed (if not picked up within X minutes)
- processing -> failed (if no heartbeat within X minutes)
- retrying -> failed (if max retries exceeded)
Stuck Detection Query:
-- Find entities stuck in non-terminal states
SELECT entity_type, status, COUNT(*),
MIN(updated_at) as oldest,
MAX(EXTRACT(EPOCH FROM (NOW() - updated_at))) as max_age_seconds
FROM (
SELECT 'job' as entity_type, status, updated_at FROM jobs
WHERE status NOT IN ('completed', 'failed', 'cancelled')
UNION ALL
SELECT 'asset', status, updated_at FROM assets
WHERE status NOT IN ('ready', 'error', 'deleted')
) stuck
GROUP BY entity_type, status
HAVING MAX(EXTRACT(EPOCH FROM (NOW() - updated_at))) > 600;
TEST-SM-003: Impossible State Prevention
Objective: Verify that contradictory states cannot exist in the database.
Test Cases:
| Impossible State | How to Attempt | Expected Result |
|---|---|---|
| Job "completed" with no output | Delete output, try to mark complete | Transition rejected |
| Job "processing" with no worker assigned | Directly set status via API/DB | Validation error or worker assignment enforced |
| Asset "ready" with no file URL | Set status without file | Constraint violation |
| Two jobs "processing" for same entity | Start second job while first runs | Second blocked or first cancelled |
| Parent "completed" with children "processing" | Complete parent before children | Transition blocked until children are terminal |
| Entity "deleted" but still queryable | Soft-delete, then query | Excluded from normal queries |
Implementation Verification:
[ ] Database constraints enforce impossible-state rules
- CHECK constraints on state + required fields
- Unique partial indexes (e.g., one active job per entity)
[ ] Application-level validation before state change
[ ] Post-transition assertions (verify invariants after every state change)
TEST-SM-004: UI State Accuracy
Objective: Verify that the UI shows state that matches the backend.
Steps:
- For each state, put an entity into that state (via API or DB).
- Load the UI and verify the displayed status.
- Check that UI elements match:
- Status badge/label
- Available actions (buttons enabled/disabled)
- Progress indicators
- Error messages
State-to-UI Mapping:
| Backend State | UI Label | Actions Available | Progress | Color |
|---|---|---|---|---|
| queued | "Waiting..." | Cancel | Indeterminate spinner | Gray |
| processing | "Generating..." | Cancel | Progress bar (if known) | Blue |
| completed | "Ready" | View, Download, Regenerate | Hidden | Green |
| failed | "Failed" | Retry, View Error | Hidden | Red |
| cancelled | "Cancelled" | Restart | Hidden | Gray |
| partial | "Partially Complete" | Retry Failed, View Completed | X/Y progress | Orange |
Pass Criteria:
- Every backend state has a corresponding UI representation.
- No UI state that doesn't exist in the backend (e.g., UI-only "loading" ≠ backend state).
- Actions are correctly gated by state (no "Download" button on failed items).
- Real-time or polling updates move UI state forward without manual refresh.
TEST-SM-005: State Transition Logging
Objective: Verify that state transitions are recorded for debugging and auditing.
Steps:
- Walk an entity through its complete lifecycle.
- Check that each transition is recorded.
Expected Log:
| Timestamp | Entity | Entity ID | From | To | Trigger | Actor |
|---------------------|---------|-----------|------------|------------|------------|----------|
| 2024-01-15 10:00:00 | job | job_123 | (created) | queued | user_submit| user_456 |
| 2024-01-15 10:00:05 | job | job_123 | queued | processing | worker_pick| worker_1 |
| 2024-01-15 10:01:30 | job | job_123 | processing | failed | provider_error | worker_1 |
| 2024-01-15 10:01:35 | job | job_123 | failed | retrying | auto_retry | system |
| 2024-01-15 10:01:40 | job | job_123 | retrying | queued | re_enqueue | system |
| 2024-01-15 10:01:45 | job | job_123 | queued | processing | worker_pick| worker_2 |
| 2024-01-15 10:02:50 | job | job_123 | processing | completed | success | worker_2 |
Pass Criteria:
- Every transition is logged with: timestamp, from, to, trigger, actor.
- Transition log is queryable per entity (reconstruct full lifecycle).
- No state changes happen without a log entry.
- Log entries are immutable (append-only).
TEST-SM-006: Child State Consistency
Objective: Verify that parent and child entity states remain consistent.
Steps:
- Cancel a parent entity (project/batch) while children are processing.
- Verify all children are moved to a terminal state.
- Start processing on a parent. Complete some children, fail others.
- Verify parent state reflects child summary.
Rules:
Parent state derivation:
- All children pending -> parent: pending
- Any child processing -> parent: processing
- All children completed -> parent: completed
- All children terminal, any failed -> parent: partial (or failed if all failed)
- Parent cancelled -> all non-terminal children: cancelled
[ ] Parent state is derived from children (not independently set)
[ ] Parent cancellation cascades to children
[ ] Child completion triggers parent state recalculation
[ ] No orphaned children (parent deleted but children remain)
State Bug vs Code Bug: How to Distinguish
| Situation | State Bug? | Code Bug? | How to Tell |
|---|---|---|---|
| Job stuck in "processing" forever | YES | Maybe | Check: is there a timeout? Is the worker alive? |
| Job shows "completed" but output is wrong | NO | YES | State is correct; the processing logic has a defect |
| Job shows "failed" but output actually exists | YES | Maybe | State transition happened before output check |
| UI shows "processing" but backend says "completed" | YES (UI) | NO | UI polling/websocket is broken; backend is correct |
| Two jobs both "processing" for same entity | YES | YES | Missing uniqueness constraint AND missing lock |
| Entity has status "procesing" (typo) | YES | YES | String-based states without enum validation |
State Machine Implementation Template
// Type-safe state machine with transition validation
enum JobStatus {
Queued = 'queued',
Processing = 'processing',
Completed = 'completed',
Failed = 'failed',
Cancelled = 'cancelled',
Retrying = 'retrying',
}
const VALID_TRANSITIONS: Record<JobStatus, JobStatus[]> = {
[JobStatus.Queued]: [JobStatus.Processing, JobStatus.Failed, JobStatus.Cancelled],
[JobStatus.Processing]: [JobStatus.Completed, JobStatus.Failed, JobStatus.Cancelled, JobStatus.Retrying],
[JobStatus.Completed]: [], // terminal
[JobStatus.Failed]: [JobStatus.Retrying],
[JobStatus.Cancelled]: [], // terminal
[JobStatus.Retrying]: [JobStatus.Queued, JobStatus.Failed, JobStatus.Cancelled],
};
function transitionJob(job: Job, newStatus: JobStatus): void {
const allowed = VALID_TRANSITIONS[job.status];
if (!allowed.includes(newStatus)) {
throw new InvalidTransitionError(
`Cannot transition job ${job.id} from ${job.status} to ${newStatus}`
);
}
// Log transition
await logTransition(job.id, job.status, newStatus);
// Update with optimistic lock
const updated = await db.jobs.update({
where: { id: job.id, status: job.status }, // optimistic lock
data: { status: newStatus, updated_at: new Date() },
});
if (!updated) throw new ConcurrentModificationError();
}
Post-Audit Checklist
[ ] All stateful entities identified and documented
[ ] States defined as enums (not strings)
[ ] Valid transitions defined in code (transition matrix)
[ ] Invalid transitions rejected with errors
[ ] Terminal states identified; every state can reach one
[ ] Timeout detection for non-terminal states
[ ] Parent-child state consistency enforced
[ ] UI accurately reflects backend state for all states
[ ] State transitions logged with timestamp, from, to, trigger, actor
[ ] Impossible states prevented by DB constraints and app logic
[ ] No compound boolean states; single status field per entity
[ ] Stuck entity alerting configured
What Earlier Audits Miss
Standard testing verifies happy-path state transitions. This audit matters because:
- Unit tests validate individual transitions but miss reachability gaps (states that can be entered but never exited).
- Integration tests rarely test impossible state combinations (e.g., "completed" with zero outputs).
- Code reviews catch individual state assignments but miss that new states were added without updating the transition matrix.
- QA testing follows scripted flows and never puts the system into states caused by worker crashes, network partitions, or partial failures.
- Monitoring tracks error rates but not stuck-state accumulation. Jobs silently pile up in "processing" without anyone noticing.
This would be called a State Machine Audit -- specifically testing whether all entity states are explicitly defined, transitions are validated, and every non-terminal state can reach a terminal state under normal operation, failure conditions, and concurrent access.
Extended Transition Matrix Examples
Asset Status Transitions
FROM \ TO | pending | generating | ready | error | deleted | regenerating |
---------------|---------|------------|-------|-------|---------|--------------|
(initial) | Y | N | N | N | N | N |
pending | - | Y | N | Y | Y | N |
generating | N | - | Y | Y | N | N |
ready | N | N | - | N | Y | Y |
error | N | Y | N | - | Y | N |
deleted | N | N | N | N | - | N |
regenerating | N | N | Y | Y | N | - |
Terminal states: deleted
Resettable terminal: error (can retry -> generating)
Payment Status Transitions
FROM \ TO | pending | authorized | captured | refunded | failed | disputed |
---------------|---------|------------|----------|----------|--------|----------|
(initial) | Y | N | N | N | N | N |
pending | - | Y | N | N | Y | N |
authorized | N | - | Y | Y | Y | N |
captured | N | N | - | Y | N | Y |
refunded | N | N | N | - | N | N |
failed | N | N | N | N | - | N |
disputed | N | N | Y | Y | N | - |
Terminal states: refunded, failed
Automation Opportunities
| Test | Automatable? | Method |
|---|---|---|
| TEST-SM-001: Enumerate states | YES | Static analysis: grep for status assignments, compare against enum |
| TEST-SM-002: Terminal reachability | YES | Graph traversal test on transition matrix; assert no dead-end non-terminal states |
| TEST-SM-003: Impossible states | PARTIAL | DB constraint validation + scheduled SQL integrity checks |
| TEST-SM-004: UI state accuracy | MANUAL | Visual inspection of each state rendering |
| TEST-SM-005: Transition logging | YES | Integration test: walk lifecycle, assert all transitions logged |
| TEST-SM-006: Child consistency | YES | Integration test: cancel parent, assert children terminal |
# Automated state enumeration check
# Find all unique status values used in code
grep -rohn "status.*=.*['\"]\\([a-z_]*\\)['\"]" src/ | \
sed "s/.*['\"]\\([a-z_]*\\)['\"].*/\\1/" | sort -u > used_states.txt
# Compare against defined enum
grep -o "[A-Za-z]* = '[a-z_]*'" src/types/status.ts | \
sed "s/.* = '\\(.*\\)'/\\1/" | sort -u > defined_states.txt
diff defined_states.txt used_states.txt
# Any lines in used but not defined = undocumented state (FAIL)
Reusable Audit Report Template
# State Machine Audit Report
## System: _______________
## Date: YYYY-MM-DD
## Auditor: _______________
## Stateful Entities Identified
| Entity | State Field | States Found | States Documented | Enum? |
|--------|------------|-------------|-------------------|-------|
| Job | status | ___ | ___ | [ ] |
| Asset | status | ___ | ___ | [ ] |
## Transition Matrix Verification
| Entity | Valid Transitions Defined? | Enforced in Code? | Logged? |
|--------|--------------------------|-------------------|---------|
| Job | PASS/FAIL | PASS/FAIL | PASS/FAIL |
| Asset | PASS/FAIL | PASS/FAIL | PASS/FAIL |
## Test Results
| Test ID | Description | Result | Evidence |
|---------|-------------|--------|----------|
| TEST-SM-001 | State enumeration | PASS/FAIL | ___ states found vs ___ defined |
| TEST-SM-002 | Terminal reachability | PASS/FAIL | Unreachable terminals: ___ |
| TEST-SM-003 | Impossible states | PASS/FAIL | ___ impossible states achievable |
| TEST-SM-004 | UI accuracy | PASS/FAIL | ___ states misrepresented |
| TEST-SM-005 | Transition logging | PASS/FAIL | ___ transitions unlogged |
| TEST-SM-006 | Child consistency | PASS/FAIL | Orphaned children after cancel: ___ |
## Score: PASS / PARTIAL / FAIL
Priority Targeting
Run this audit FIRST if:
- Operators regularly fix stuck jobs by updating the database directly
- The UI shows contradictory information ("Completed" with no output)
- Status values are stored as strings without validation
- New states have been added without updating transition rules
- Parent entities can be in a state that contradicts their children
Install this skill directly: skilldb add production-audit-skills