Skip to main content
Film & TelevisionProduction Audit531 lines

Observability & Debuggability Audit

Quick Summary34 lines
Verify that production failures can be diagnosed quickly and completely. When something goes wrong at 2am, the on-call engineer needs to answer: What failed? Why? Which user? Which request? What was the input? What did the external service return? How long did each step take?

## Key Points

1. Trigger a generation/processing workflow via the UI.
2. Note the time of the request.
3. Search logs for the request.
4. Follow the request through every step.
- [ ] Every step is present in logs.
- [ ] The same correlation ID links all steps.
- [ ] An engineer can reconstruct the entire flow from logs alone.
- [ ] Total time from logs matches actual elapsed time.
- Gaps in the trace (steps with no log entry).
- Correlation ID missing from any step.
- Cannot determine which external call was made or what it returned.
- Duration not captured for any step.

## Quick Example

```json
{"batch_id": "batch-123", "item_id": "asset-456", "item_name": "hero.png",
 "status": "failed", "error": "Provider rejected: content policy violation",
 "provider_error_code": "content_filter", "attempt": 2, "request_id": "abc-123"}
```

```
[ ] timestamp (ISO 8601)
[ ] level (ERROR, WARN, INFO, DEBUG)
[ ] message (human-readable description)
[ ] service (which service/worker emitted this)
[ ] request_id / correlation_id
```
skilldb get production-audit-skills/observability-debuggability-auditFull skill: 531 lines
Paste into your CLAUDE.md or agent config

Observability & Debuggability Audit

Purpose

Verify that production failures can be diagnosed quickly and completely. When something goes wrong at 2am, the on-call engineer needs to answer: What failed? Why? Which user? Which request? What was the input? What did the external service return? How long did each step take?

If the logs cannot answer these questions, the system is operating blind.


Scope

AreaWhat We Test
Structured loggingAre logs machine-parseable with consistent fields?
Correlation IDsCan a single request be traced across all services and logs?
Error detailDo errors include enough context to diagnose without reproducing?
Per-step timingIs duration captured for each significant operation?
External call loggingAre provider requests and responses captured?
AlertingDo critical failures trigger alerts?
DashboardsCan system health be assessed at a glance?
Audit trailCan user actions be reconstructed?

Risk Pattern Table

PatternWhat It HitsRiskSymptom
Unstructured log messagesDebuggingHIGHCannot search/filter logs programmatically; grep through raw text
No correlation IDDebuggingCRITICALCannot connect request -> job -> external call -> response
"Unknown error" messagesUX, DebuggingHIGHUser sees generic error; engineer has no starting point
Missing external call loggingDebuggingHIGH"Provider failed" but no request body, response, or status code
No per-step timingPerformanceMEDIUM"Job took 60s" but cannot tell which step was slow
Swallowed exceptionsReliabilityHIGHcatch(e) {} -- error occurs but is invisible
Log level misuseOperationsMEDIUMErrors logged as INFO; warnings logged as DEBUG; noise drowns signal
No entity identifiers in logsDebuggingHIGH"Job failed" but which job? For which user? Which project?
Missing input loggingDebuggingMEDIUMCannot reproduce failure because input parameters are not recorded
No health check endpointOperationsMEDIUMCannot determine if service is alive vs degraded vs down

Concrete Test Cases

TEST-OB-001: End-to-End Request Tracing

Objective: Follow a single user action from API request through all processing steps to completion, using only logs.

Steps:

  1. Trigger a generation/processing workflow via the UI.
  2. Note the time of the request.
  3. Search logs for the request.
  4. Follow the request through every step.

What to Trace:

Step 1: HTTP Request received
  [ ] Request ID / Correlation ID present
  [ ] User ID present
  [ ] Endpoint and method logged
  [ ] Request body logged (or key parameters)
  [ ] Timestamp

Step 2: Business logic
  [ ] Entity IDs present (project_id, asset_id, etc.)
  [ ] Same correlation ID as step 1
  [ ] Decision points logged ("generating 20 items", "using provider X")

Step 3: Job enqueued
  [ ] Job ID present
  [ ] Same correlation ID
  [ ] Queue name and priority logged
  [ ] Job parameters logged

Step 4: Job picked up by worker
  [ ] Same job ID and correlation ID
  [ ] Worker identifier present
  [ ] Processing start time logged

Step 5: External API call
  [ ] Provider name and endpoint logged
  [ ] Request parameters logged (sanitized - no secrets)
  [ ] Response status code logged
  [ ] Response time (duration) logged
  [ ] Response body logged (or key fields)
  [ ] Same correlation ID

Step 6: Result storage
  [ ] Storage operation logged (file path, size)
  [ ] Duration logged
  [ ] Same correlation ID

Step 7: Completion
  [ ] Final status logged (success/failure)
  [ ] Total duration logged
  [ ] Same correlation ID throughout

Pass Criteria:

  • Every step is present in logs.
  • The same correlation ID links all steps.
  • An engineer can reconstruct the entire flow from logs alone.
  • Total time from logs matches actual elapsed time.

Fail Criteria:

  • Gaps in the trace (steps with no log entry).
  • Correlation ID missing from any step.
  • Cannot determine which external call was made or what it returned.
  • Duration not captured for any step.

TEST-OB-002: Correlation ID Propagation

Objective: Verify that correlation IDs are present on every log line and propagated across service boundaries.

Steps:

  1. Check: is a correlation ID generated at the API gateway/entry point?
  2. Check: is it passed to background jobs?
  3. Check: is it included in external API calls (as header)?
  4. Check: is it included in webhook responses?
  5. Search logs for a known correlation ID. Verify all related entries appear.

Implementation Check:

[ ] Correlation ID generated on every incoming request (middleware)
[ ] Stored in request context / async local storage
[ ] Automatically included in every log entry
[ ] Passed to background jobs (in job payload)
[ ] Included in external API calls (X-Request-ID header)
[ ] Returned in API responses (for client-side correlation)
[ ] Indexed in log storage for fast retrieval

Propagation Diagram:

Client Request (generates or receives request_id)
  -> API Server (logs with request_id)
    -> Database queries (tagged with request_id in comments)
    -> Enqueue job (job.metadata.request_id = request_id)
      -> Worker (logs with request_id from job metadata)
        -> External API (sends X-Request-ID: request_id)
        -> Storage (logs with request_id)
    -> Response (includes X-Request-ID header)

TEST-OB-003: Error Message Quality

Objective: Verify that error messages are specific, actionable, and include diagnostic context.

Steps:

  1. Trigger various failure modes.
  2. For each, check: the user-facing message, the log entry, and any alert.

Error Quality Matrix:

Error ScenarioBAD MessageGOOD Message
Provider API timeout"Unknown error""AI provider timed out after 30s for asset abc-123. Request ID: xyz-789"
Invalid user input"Bad request""Field 'prompt' exceeds 4000 character limit (received: 4523)"
Permission denied"Error""User user_456 lacks 'write' permission on project proj_789"
Rate limited"Please try again""Rate limit exceeded: 10/min for /api/generate. Retry after: 45s"
Storage failure"Upload failed""Failed to upload to gs://bucket/path: 503 Service Unavailable (attempt 2/3)"
DB constraint violation"Server error""Duplicate asset name 'hero.png' in project proj_789. Use a unique name."

Pass Criteria:

  • No error message is "Unknown error", "Something went wrong", or empty.
  • Every error log includes: entity IDs, user ID, operation attempted, specific failure reason.
  • User-facing errors are helpful without exposing internals (no stack traces to users).
  • Log-level errors include full diagnostic context (stack traces, request bodies, response bodies).

TEST-OB-004: Per-Step Duration Logging

Objective: Verify that each significant operation step has timing information in logs.

Steps:

  1. Run a multi-step workflow.
  2. Extract all timing entries from logs.
  3. Verify each step has duration.

Expected Timing Log:

{"step": "validate_input", "duration_ms": 5, "request_id": "abc-123"}
{"step": "check_permissions", "duration_ms": 12, "request_id": "abc-123"}
{"step": "enqueue_job", "duration_ms": 8, "request_id": "abc-123"}
{"step": "call_ai_provider", "duration_ms": 8500, "request_id": "abc-123", "provider": "openai", "model": "gpt-4"}
{"step": "upload_to_storage", "duration_ms": 1200, "request_id": "abc-123", "file_size_bytes": 524288}
{"step": "update_database", "duration_ms": 15, "request_id": "abc-123"}
{"step": "total_processing", "duration_ms": 9740, "request_id": "abc-123"}

Pass Criteria:

  • Every step that takes > 100ms has duration logging.
  • External calls always have duration logging regardless of speed.
  • Total duration is logged at operation completion.
  • Duration is in a consistent unit (milliseconds recommended).
  • Slow operations (> p95) are logged at WARN level.

TEST-OB-005: External Call Detail Logging

Objective: Verify that all external API calls are logged with sufficient detail for debugging.

Steps:

  1. Identify all external services called (AI providers, storage, payment, email, etc.).
  2. For each, verify logging of: request, response, duration, errors.

External Call Log Checklist:

For EVERY external call:
[ ] Provider name logged
[ ] Endpoint/method logged
[ ] Request parameters logged (key fields, not full binary)
[ ] Request size logged (if relevant)
[ ] Response status code logged
[ ] Response body logged (key fields, or full if small)
[ ] Response time logged (milliseconds)
[ ] Retry attempt number logged (if retrying)
[ ] Error details logged on failure (full error, not just "failed")
[ ] Correlation ID included

Sensitive data handling:
[ ] API keys are NOT logged (masked or omitted)
[ ] User PII is handled per policy (masked or omitted)
[ ] Binary content is NOT logged (log size/hash instead)
[ ] Secrets in request bodies are redacted

TEST-OB-006: Log Level Correctness

Objective: Verify that log levels are used consistently and correctly.

Standards:

ERROR: Something broke and needs attention. An operation failed.
  Examples: External API 500, DB write failed, job crashed
  NOT: User input validation failure, 404 on missing resource

WARN: Something unexpected that may need attention.
  Examples: Retry triggered, slow query detected, rate limit approaching
  NOT: Normal business logic branches

INFO: Normal operations worth recording.
  Examples: Request received, job started, job completed, user logged in
  NOT: Every function call, every DB query

DEBUG: Detailed information for development.
  Examples: Full request/response bodies, intermediate calculations
  NOT: Enabled in production by default

Steps:

  1. Review log output for a normal workflow. Verify no ERRORs for expected behavior.
  2. Trigger a failure. Verify ERROR level used with full context.
  3. Check production log volume. Verify INFO is not overwhelming.

Pass Criteria:

  • Production logs at INFO level do not exceed 100 lines/minute during idle.
  • Every ERROR log represents a genuine failure requiring investigation.
  • WARN logs are actionable (not just noise).
  • DEBUG logs are disabled in production (or behind a flag).

TEST-OB-007: Health Check and Readiness

Objective: Verify that service health can be assessed programmatically.

Steps:

  1. Hit the health check endpoint: GET /health or GET /healthz.
  2. Verify it checks actual dependencies (DB, cache, queue).
  3. Simulate a dependency failure. Verify health check reflects it.

Health Check Response Template:

{
  "status": "healthy",
  "version": "1.2.3",
  "uptime_seconds": 86400,
  "checks": {
    "database": {"status": "healthy", "latency_ms": 5},
    "redis": {"status": "healthy", "latency_ms": 2},
    "storage": {"status": "healthy", "latency_ms": 45},
    "queue": {"status": "healthy", "depth": 12}
  }
}

Pass Criteria:

  • Health endpoint exists and responds within 5 seconds.
  • Checks real dependencies (not just returns 200 always).
  • Returns structured JSON with per-dependency status.
  • Unhealthy dependency changes overall status.
  • Load balancer / orchestrator configured to use health endpoint.

TEST-OB-008: Failure Item Identification

Objective: Verify that when items in a batch fail, the specific failing items are identified.

Steps:

  1. Start a batch operation on 20 items.
  2. Cause 3 items to fail (bad input, provider rejection, etc.).
  3. Check logs and UI.

Pass Criteria:

  • Failed items are identified by name/ID in logs.
  • Failure reason is specific per item (not just "3 items failed").
  • Successful items are listed separately.
  • User can see which items failed and why.
  • Retry targets only the failed items.

Log Format:

{"batch_id": "batch-123", "item_id": "asset-456", "item_name": "hero.png",
 "status": "failed", "error": "Provider rejected: content policy violation",
 "provider_error_code": "content_filter", "attempt": 2, "request_id": "abc-123"}

Structured Logging Template

{
  "timestamp": "2024-01-15T10:30:00.123Z",
  "level": "INFO",
  "message": "Asset generation completed",
  "service": "worker",
  "request_id": "req-abc-123",
  "job_id": "job-def-456",
  "user_id": "user-789",
  "project_id": "proj-012",
  "asset_id": "asset-345",
  "provider": "openai",
  "model": "dall-e-3",
  "duration_ms": 8500,
  "step": "generate_asset",
  "metadata": {
    "input_tokens": 150,
    "output_size_bytes": 524288,
    "retry_count": 0
  }
}

Required Fields on EVERY Log Entry:

[ ] timestamp (ISO 8601)
[ ] level (ERROR, WARN, INFO, DEBUG)
[ ] message (human-readable description)
[ ] service (which service/worker emitted this)
[ ] request_id / correlation_id

Required Fields on Operation Logs:

[ ] user_id (who triggered this)
[ ] entity IDs (what was operated on)
[ ] duration_ms (how long it took)
[ ] status (success/failure/partial)

Alerting Checklist

Critical (page on-call):
[ ] Service down (health check failing > 2 minutes)
[ ] Error rate > 5% of requests
[ ] External provider failure rate > 10%
[ ] Queue depth growing without processing (stuck)
[ ] Zero successful jobs in past 15 minutes (during business hours)

Warning (notify team channel):
[ ] p95 latency > 2x baseline
[ ] Disk/memory usage > 80%
[ ] Background job failure rate > 5%
[ ] Certificate expiry within 14 days
[ ] Unusual traffic pattern (10x normal)

Dashboard metrics:
[ ] Request rate (RPM)
[ ] Error rate (% and count)
[ ] Latency (p50, p95, p99)
[ ] Queue depth and processing rate
[ ] External API response time and error rate
[ ] Active users / sessions
[ ] Job completion rate

Post-Audit Checklist

[ ] Correlation ID generated on every request and propagated to all steps
[ ] All logs are structured JSON (not plain text)
[ ] Every log entry has timestamp, level, service, request_id
[ ] Error logs include: entity IDs, user ID, full error details, stack trace
[ ] No "Unknown error" messages in user-facing responses
[ ] External API calls logged with: request, response status, duration
[ ] Per-step timing logged for all operations > 100ms
[ ] Health check endpoint exists and tests real dependencies
[ ] Failed batch items identified by name/ID with specific error
[ ] Log levels used correctly (ERROR = real failures only)
[ ] Alerting configured for critical failure modes
[ ] Sensitive data (keys, PII) redacted from logs
[ ] Log retention configured (30+ days for debugging)
[ ] Logs are searchable/filterable by correlation ID

What Earlier Audits Miss

Standard testing verifies that features work. This audit matters because:

  • Unit tests catch logic errors but never verify that errors produce useful log output.
  • Integration tests verify outcomes but not that the diagnostic trail exists to debug when outcomes are wrong.
  • Code reviews check error handling logic but rarely verify that errors include entity IDs, request context, and provider response details.
  • QA testing verifies user-facing error messages but never checks what the engineering team sees in logs.
  • Security audits verify that sensitive data is protected but not that enough non-sensitive data is logged for debugging.

This would be called an Observability & Debuggability Audit -- specifically testing whether production failures can be diagnosed quickly and completely under real incident conditions using only logs, metrics, and alerts.


Automation Opportunities

TestAutomatable?Method
TEST-OB-001: E2E tracingPARTIALTrigger workflow, search logs for correlation ID, assert all steps present
TEST-OB-002: Correlation IDYESMiddleware test: assert every response includes X-Request-ID header
TEST-OB-003: Error qualityPARTIALTrigger known errors, assert log messages match quality template
TEST-OB-004: Per-step timingYESTrigger workflow, parse logs, assert duration fields present
TEST-OB-005: External call loggingYESMock external call, assert log includes request/response/duration
TEST-OB-006: Log level correctnessPARTIALTrigger normal + error flows, assert no ERRORs on happy path
TEST-OB-007: Health checkYESHit /health endpoint, assert structured response with dependency status
TEST-OB-008: Failure identificationYESFail items in batch, assert failed items identified by ID in logs
# Automated correlation ID propagation test
RESPONSE=$(curl -s -D - /api/projects)
REQUEST_ID=$(echo "$RESPONSE" | grep -i "x-request-id" | awk '{print $2}' | tr -d '\r')
if [ -z "$REQUEST_ID" ]; then
  echo "FAIL: No X-Request-ID in response headers"
else
  echo "PASS: Request ID = $REQUEST_ID"
  # Verify it appears in logs
  LOG_COUNT=$(grep -c "$REQUEST_ID" /var/log/app/*.log 2>/dev/null || echo "0")
  echo "Log entries with this ID: $LOG_COUNT"
fi

# Automated health check verification
HEALTH=$(curl -s /health)
echo "$HEALTH" | jq -e '.checks.database.status' > /dev/null && echo "PASS: DB check present" || echo "FAIL: No DB check"
echo "$HEALTH" | jq -e '.checks.redis.status' > /dev/null && echo "PASS: Redis check present" || echo "FAIL: No Redis check"

Reusable Audit Report Template

# Observability & Debuggability Audit Report

## System: _______________
## Date: YYYY-MM-DD
## Auditor: _______________

## Log Infrastructure
| Aspect | Status | Details |
|--------|--------|---------|
| Structured logging | yes/no | Format: JSON/text |
| Correlation IDs | yes/no | Generated at: ___ |
| Log retention | ___ days | Storage: ___ |
| Log search | yes/no | Tool: ___ |

## Test Results
| Test ID | Description | Result | Evidence |
|---------|-------------|--------|----------|
| TEST-OB-001 | E2E tracing | PASS/FAIL | ___ steps traceable out of ___ |
| TEST-OB-002 | Correlation ID | PASS/FAIL | Propagated to: API/jobs/external/all |
| TEST-OB-003 | Error quality | PASS/FAIL | ___ errors are generic |
| TEST-OB-004 | Per-step timing | PASS/FAIL | ___ steps have duration logging |
| TEST-OB-005 | External call detail | PASS/FAIL | ___ providers missing request/response logging |
| TEST-OB-006 | Log levels | PASS/FAIL | ___ ERRORs on happy path |
| TEST-OB-007 | Health check | PASS/FAIL | Dependencies checked: ___ |
| TEST-OB-008 | Failure identification | PASS/FAIL | Failed items identified by ID: yes/no |

## Alerting Coverage
| Failure Mode | Alert Exists? | Tested? | Response Time |
|-------------|-------------|---------|---------------|
| Service down | | | |
| Error rate spike | | | |
| External provider failure | | | |
| Queue depth growing | | | |

## Score: PASS / PARTIAL / FAIL

Priority Targeting

Run this audit FIRST if:

  • Debugging production issues requires SSH into servers and grepping raw logs
  • The phrase "I can't tell what happened" comes up regularly
  • Error messages shown to users are generic
  • External service failures take hours to diagnose
  • No alerting is configured for critical failures
  • Log storage is unstructured text files

Install this skill directly: skilldb add production-audit-skills

Get CLI access →