Observability & Debuggability Audit
Verify that production failures can be diagnosed quickly and completely. When something goes wrong at 2am, the on-call engineer needs to answer: What failed? Why? Which user? Which request? What was the input? What did the external service return? How long did each step take?
## Key Points
1. Trigger a generation/processing workflow via the UI.
2. Note the time of the request.
3. Search logs for the request.
4. Follow the request through every step.
- [ ] Every step is present in logs.
- [ ] The same correlation ID links all steps.
- [ ] An engineer can reconstruct the entire flow from logs alone.
- [ ] Total time from logs matches actual elapsed time.
- Gaps in the trace (steps with no log entry).
- Correlation ID missing from any step.
- Cannot determine which external call was made or what it returned.
- Duration not captured for any step.
## Quick Example
```json
{"batch_id": "batch-123", "item_id": "asset-456", "item_name": "hero.png",
"status": "failed", "error": "Provider rejected: content policy violation",
"provider_error_code": "content_filter", "attempt": 2, "request_id": "abc-123"}
```
```
[ ] timestamp (ISO 8601)
[ ] level (ERROR, WARN, INFO, DEBUG)
[ ] message (human-readable description)
[ ] service (which service/worker emitted this)
[ ] request_id / correlation_id
```skilldb get production-audit-skills/observability-debuggability-auditFull skill: 531 linesObservability & Debuggability Audit
Purpose
Verify that production failures can be diagnosed quickly and completely. When something goes wrong at 2am, the on-call engineer needs to answer: What failed? Why? Which user? Which request? What was the input? What did the external service return? How long did each step take?
If the logs cannot answer these questions, the system is operating blind.
Scope
| Area | What We Test |
|---|---|
| Structured logging | Are logs machine-parseable with consistent fields? |
| Correlation IDs | Can a single request be traced across all services and logs? |
| Error detail | Do errors include enough context to diagnose without reproducing? |
| Per-step timing | Is duration captured for each significant operation? |
| External call logging | Are provider requests and responses captured? |
| Alerting | Do critical failures trigger alerts? |
| Dashboards | Can system health be assessed at a glance? |
| Audit trail | Can user actions be reconstructed? |
Risk Pattern Table
| Pattern | What It Hits | Risk | Symptom |
|---|---|---|---|
| Unstructured log messages | Debugging | HIGH | Cannot search/filter logs programmatically; grep through raw text |
| No correlation ID | Debugging | CRITICAL | Cannot connect request -> job -> external call -> response |
| "Unknown error" messages | UX, Debugging | HIGH | User sees generic error; engineer has no starting point |
| Missing external call logging | Debugging | HIGH | "Provider failed" but no request body, response, or status code |
| No per-step timing | Performance | MEDIUM | "Job took 60s" but cannot tell which step was slow |
| Swallowed exceptions | Reliability | HIGH | catch(e) {} -- error occurs but is invisible |
| Log level misuse | Operations | MEDIUM | Errors logged as INFO; warnings logged as DEBUG; noise drowns signal |
| No entity identifiers in logs | Debugging | HIGH | "Job failed" but which job? For which user? Which project? |
| Missing input logging | Debugging | MEDIUM | Cannot reproduce failure because input parameters are not recorded |
| No health check endpoint | Operations | MEDIUM | Cannot determine if service is alive vs degraded vs down |
Concrete Test Cases
TEST-OB-001: End-to-End Request Tracing
Objective: Follow a single user action from API request through all processing steps to completion, using only logs.
Steps:
- Trigger a generation/processing workflow via the UI.
- Note the time of the request.
- Search logs for the request.
- Follow the request through every step.
What to Trace:
Step 1: HTTP Request received
[ ] Request ID / Correlation ID present
[ ] User ID present
[ ] Endpoint and method logged
[ ] Request body logged (or key parameters)
[ ] Timestamp
Step 2: Business logic
[ ] Entity IDs present (project_id, asset_id, etc.)
[ ] Same correlation ID as step 1
[ ] Decision points logged ("generating 20 items", "using provider X")
Step 3: Job enqueued
[ ] Job ID present
[ ] Same correlation ID
[ ] Queue name and priority logged
[ ] Job parameters logged
Step 4: Job picked up by worker
[ ] Same job ID and correlation ID
[ ] Worker identifier present
[ ] Processing start time logged
Step 5: External API call
[ ] Provider name and endpoint logged
[ ] Request parameters logged (sanitized - no secrets)
[ ] Response status code logged
[ ] Response time (duration) logged
[ ] Response body logged (or key fields)
[ ] Same correlation ID
Step 6: Result storage
[ ] Storage operation logged (file path, size)
[ ] Duration logged
[ ] Same correlation ID
Step 7: Completion
[ ] Final status logged (success/failure)
[ ] Total duration logged
[ ] Same correlation ID throughout
Pass Criteria:
- Every step is present in logs.
- The same correlation ID links all steps.
- An engineer can reconstruct the entire flow from logs alone.
- Total time from logs matches actual elapsed time.
Fail Criteria:
- Gaps in the trace (steps with no log entry).
- Correlation ID missing from any step.
- Cannot determine which external call was made or what it returned.
- Duration not captured for any step.
TEST-OB-002: Correlation ID Propagation
Objective: Verify that correlation IDs are present on every log line and propagated across service boundaries.
Steps:
- Check: is a correlation ID generated at the API gateway/entry point?
- Check: is it passed to background jobs?
- Check: is it included in external API calls (as header)?
- Check: is it included in webhook responses?
- Search logs for a known correlation ID. Verify all related entries appear.
Implementation Check:
[ ] Correlation ID generated on every incoming request (middleware)
[ ] Stored in request context / async local storage
[ ] Automatically included in every log entry
[ ] Passed to background jobs (in job payload)
[ ] Included in external API calls (X-Request-ID header)
[ ] Returned in API responses (for client-side correlation)
[ ] Indexed in log storage for fast retrieval
Propagation Diagram:
Client Request (generates or receives request_id)
-> API Server (logs with request_id)
-> Database queries (tagged with request_id in comments)
-> Enqueue job (job.metadata.request_id = request_id)
-> Worker (logs with request_id from job metadata)
-> External API (sends X-Request-ID: request_id)
-> Storage (logs with request_id)
-> Response (includes X-Request-ID header)
TEST-OB-003: Error Message Quality
Objective: Verify that error messages are specific, actionable, and include diagnostic context.
Steps:
- Trigger various failure modes.
- For each, check: the user-facing message, the log entry, and any alert.
Error Quality Matrix:
| Error Scenario | BAD Message | GOOD Message |
|---|---|---|
| Provider API timeout | "Unknown error" | "AI provider timed out after 30s for asset abc-123. Request ID: xyz-789" |
| Invalid user input | "Bad request" | "Field 'prompt' exceeds 4000 character limit (received: 4523)" |
| Permission denied | "Error" | "User user_456 lacks 'write' permission on project proj_789" |
| Rate limited | "Please try again" | "Rate limit exceeded: 10/min for /api/generate. Retry after: 45s" |
| Storage failure | "Upload failed" | "Failed to upload to gs://bucket/path: 503 Service Unavailable (attempt 2/3)" |
| DB constraint violation | "Server error" | "Duplicate asset name 'hero.png' in project proj_789. Use a unique name." |
Pass Criteria:
- No error message is "Unknown error", "Something went wrong", or empty.
- Every error log includes: entity IDs, user ID, operation attempted, specific failure reason.
- User-facing errors are helpful without exposing internals (no stack traces to users).
- Log-level errors include full diagnostic context (stack traces, request bodies, response bodies).
TEST-OB-004: Per-Step Duration Logging
Objective: Verify that each significant operation step has timing information in logs.
Steps:
- Run a multi-step workflow.
- Extract all timing entries from logs.
- Verify each step has duration.
Expected Timing Log:
{"step": "validate_input", "duration_ms": 5, "request_id": "abc-123"}
{"step": "check_permissions", "duration_ms": 12, "request_id": "abc-123"}
{"step": "enqueue_job", "duration_ms": 8, "request_id": "abc-123"}
{"step": "call_ai_provider", "duration_ms": 8500, "request_id": "abc-123", "provider": "openai", "model": "gpt-4"}
{"step": "upload_to_storage", "duration_ms": 1200, "request_id": "abc-123", "file_size_bytes": 524288}
{"step": "update_database", "duration_ms": 15, "request_id": "abc-123"}
{"step": "total_processing", "duration_ms": 9740, "request_id": "abc-123"}
Pass Criteria:
- Every step that takes > 100ms has duration logging.
- External calls always have duration logging regardless of speed.
- Total duration is logged at operation completion.
- Duration is in a consistent unit (milliseconds recommended).
- Slow operations (> p95) are logged at WARN level.
TEST-OB-005: External Call Detail Logging
Objective: Verify that all external API calls are logged with sufficient detail for debugging.
Steps:
- Identify all external services called (AI providers, storage, payment, email, etc.).
- For each, verify logging of: request, response, duration, errors.
External Call Log Checklist:
For EVERY external call:
[ ] Provider name logged
[ ] Endpoint/method logged
[ ] Request parameters logged (key fields, not full binary)
[ ] Request size logged (if relevant)
[ ] Response status code logged
[ ] Response body logged (key fields, or full if small)
[ ] Response time logged (milliseconds)
[ ] Retry attempt number logged (if retrying)
[ ] Error details logged on failure (full error, not just "failed")
[ ] Correlation ID included
Sensitive data handling:
[ ] API keys are NOT logged (masked or omitted)
[ ] User PII is handled per policy (masked or omitted)
[ ] Binary content is NOT logged (log size/hash instead)
[ ] Secrets in request bodies are redacted
TEST-OB-006: Log Level Correctness
Objective: Verify that log levels are used consistently and correctly.
Standards:
ERROR: Something broke and needs attention. An operation failed.
Examples: External API 500, DB write failed, job crashed
NOT: User input validation failure, 404 on missing resource
WARN: Something unexpected that may need attention.
Examples: Retry triggered, slow query detected, rate limit approaching
NOT: Normal business logic branches
INFO: Normal operations worth recording.
Examples: Request received, job started, job completed, user logged in
NOT: Every function call, every DB query
DEBUG: Detailed information for development.
Examples: Full request/response bodies, intermediate calculations
NOT: Enabled in production by default
Steps:
- Review log output for a normal workflow. Verify no ERRORs for expected behavior.
- Trigger a failure. Verify ERROR level used with full context.
- Check production log volume. Verify INFO is not overwhelming.
Pass Criteria:
- Production logs at INFO level do not exceed 100 lines/minute during idle.
- Every ERROR log represents a genuine failure requiring investigation.
- WARN logs are actionable (not just noise).
- DEBUG logs are disabled in production (or behind a flag).
TEST-OB-007: Health Check and Readiness
Objective: Verify that service health can be assessed programmatically.
Steps:
- Hit the health check endpoint: GET /health or GET /healthz.
- Verify it checks actual dependencies (DB, cache, queue).
- Simulate a dependency failure. Verify health check reflects it.
Health Check Response Template:
{
"status": "healthy",
"version": "1.2.3",
"uptime_seconds": 86400,
"checks": {
"database": {"status": "healthy", "latency_ms": 5},
"redis": {"status": "healthy", "latency_ms": 2},
"storage": {"status": "healthy", "latency_ms": 45},
"queue": {"status": "healthy", "depth": 12}
}
}
Pass Criteria:
- Health endpoint exists and responds within 5 seconds.
- Checks real dependencies (not just returns 200 always).
- Returns structured JSON with per-dependency status.
- Unhealthy dependency changes overall status.
- Load balancer / orchestrator configured to use health endpoint.
TEST-OB-008: Failure Item Identification
Objective: Verify that when items in a batch fail, the specific failing items are identified.
Steps:
- Start a batch operation on 20 items.
- Cause 3 items to fail (bad input, provider rejection, etc.).
- Check logs and UI.
Pass Criteria:
- Failed items are identified by name/ID in logs.
- Failure reason is specific per item (not just "3 items failed").
- Successful items are listed separately.
- User can see which items failed and why.
- Retry targets only the failed items.
Log Format:
{"batch_id": "batch-123", "item_id": "asset-456", "item_name": "hero.png",
"status": "failed", "error": "Provider rejected: content policy violation",
"provider_error_code": "content_filter", "attempt": 2, "request_id": "abc-123"}
Structured Logging Template
{
"timestamp": "2024-01-15T10:30:00.123Z",
"level": "INFO",
"message": "Asset generation completed",
"service": "worker",
"request_id": "req-abc-123",
"job_id": "job-def-456",
"user_id": "user-789",
"project_id": "proj-012",
"asset_id": "asset-345",
"provider": "openai",
"model": "dall-e-3",
"duration_ms": 8500,
"step": "generate_asset",
"metadata": {
"input_tokens": 150,
"output_size_bytes": 524288,
"retry_count": 0
}
}
Required Fields on EVERY Log Entry:
[ ] timestamp (ISO 8601)
[ ] level (ERROR, WARN, INFO, DEBUG)
[ ] message (human-readable description)
[ ] service (which service/worker emitted this)
[ ] request_id / correlation_id
Required Fields on Operation Logs:
[ ] user_id (who triggered this)
[ ] entity IDs (what was operated on)
[ ] duration_ms (how long it took)
[ ] status (success/failure/partial)
Alerting Checklist
Critical (page on-call):
[ ] Service down (health check failing > 2 minutes)
[ ] Error rate > 5% of requests
[ ] External provider failure rate > 10%
[ ] Queue depth growing without processing (stuck)
[ ] Zero successful jobs in past 15 minutes (during business hours)
Warning (notify team channel):
[ ] p95 latency > 2x baseline
[ ] Disk/memory usage > 80%
[ ] Background job failure rate > 5%
[ ] Certificate expiry within 14 days
[ ] Unusual traffic pattern (10x normal)
Dashboard metrics:
[ ] Request rate (RPM)
[ ] Error rate (% and count)
[ ] Latency (p50, p95, p99)
[ ] Queue depth and processing rate
[ ] External API response time and error rate
[ ] Active users / sessions
[ ] Job completion rate
Post-Audit Checklist
[ ] Correlation ID generated on every request and propagated to all steps
[ ] All logs are structured JSON (not plain text)
[ ] Every log entry has timestamp, level, service, request_id
[ ] Error logs include: entity IDs, user ID, full error details, stack trace
[ ] No "Unknown error" messages in user-facing responses
[ ] External API calls logged with: request, response status, duration
[ ] Per-step timing logged for all operations > 100ms
[ ] Health check endpoint exists and tests real dependencies
[ ] Failed batch items identified by name/ID with specific error
[ ] Log levels used correctly (ERROR = real failures only)
[ ] Alerting configured for critical failure modes
[ ] Sensitive data (keys, PII) redacted from logs
[ ] Log retention configured (30+ days for debugging)
[ ] Logs are searchable/filterable by correlation ID
What Earlier Audits Miss
Standard testing verifies that features work. This audit matters because:
- Unit tests catch logic errors but never verify that errors produce useful log output.
- Integration tests verify outcomes but not that the diagnostic trail exists to debug when outcomes are wrong.
- Code reviews check error handling logic but rarely verify that errors include entity IDs, request context, and provider response details.
- QA testing verifies user-facing error messages but never checks what the engineering team sees in logs.
- Security audits verify that sensitive data is protected but not that enough non-sensitive data is logged for debugging.
This would be called an Observability & Debuggability Audit -- specifically testing whether production failures can be diagnosed quickly and completely under real incident conditions using only logs, metrics, and alerts.
Automation Opportunities
| Test | Automatable? | Method |
|---|---|---|
| TEST-OB-001: E2E tracing | PARTIAL | Trigger workflow, search logs for correlation ID, assert all steps present |
| TEST-OB-002: Correlation ID | YES | Middleware test: assert every response includes X-Request-ID header |
| TEST-OB-003: Error quality | PARTIAL | Trigger known errors, assert log messages match quality template |
| TEST-OB-004: Per-step timing | YES | Trigger workflow, parse logs, assert duration fields present |
| TEST-OB-005: External call logging | YES | Mock external call, assert log includes request/response/duration |
| TEST-OB-006: Log level correctness | PARTIAL | Trigger normal + error flows, assert no ERRORs on happy path |
| TEST-OB-007: Health check | YES | Hit /health endpoint, assert structured response with dependency status |
| TEST-OB-008: Failure identification | YES | Fail items in batch, assert failed items identified by ID in logs |
# Automated correlation ID propagation test
RESPONSE=$(curl -s -D - /api/projects)
REQUEST_ID=$(echo "$RESPONSE" | grep -i "x-request-id" | awk '{print $2}' | tr -d '\r')
if [ -z "$REQUEST_ID" ]; then
echo "FAIL: No X-Request-ID in response headers"
else
echo "PASS: Request ID = $REQUEST_ID"
# Verify it appears in logs
LOG_COUNT=$(grep -c "$REQUEST_ID" /var/log/app/*.log 2>/dev/null || echo "0")
echo "Log entries with this ID: $LOG_COUNT"
fi
# Automated health check verification
HEALTH=$(curl -s /health)
echo "$HEALTH" | jq -e '.checks.database.status' > /dev/null && echo "PASS: DB check present" || echo "FAIL: No DB check"
echo "$HEALTH" | jq -e '.checks.redis.status' > /dev/null && echo "PASS: Redis check present" || echo "FAIL: No Redis check"
Reusable Audit Report Template
# Observability & Debuggability Audit Report
## System: _______________
## Date: YYYY-MM-DD
## Auditor: _______________
## Log Infrastructure
| Aspect | Status | Details |
|--------|--------|---------|
| Structured logging | yes/no | Format: JSON/text |
| Correlation IDs | yes/no | Generated at: ___ |
| Log retention | ___ days | Storage: ___ |
| Log search | yes/no | Tool: ___ |
## Test Results
| Test ID | Description | Result | Evidence |
|---------|-------------|--------|----------|
| TEST-OB-001 | E2E tracing | PASS/FAIL | ___ steps traceable out of ___ |
| TEST-OB-002 | Correlation ID | PASS/FAIL | Propagated to: API/jobs/external/all |
| TEST-OB-003 | Error quality | PASS/FAIL | ___ errors are generic |
| TEST-OB-004 | Per-step timing | PASS/FAIL | ___ steps have duration logging |
| TEST-OB-005 | External call detail | PASS/FAIL | ___ providers missing request/response logging |
| TEST-OB-006 | Log levels | PASS/FAIL | ___ ERRORs on happy path |
| TEST-OB-007 | Health check | PASS/FAIL | Dependencies checked: ___ |
| TEST-OB-008 | Failure identification | PASS/FAIL | Failed items identified by ID: yes/no |
## Alerting Coverage
| Failure Mode | Alert Exists? | Tested? | Response Time |
|-------------|-------------|---------|---------------|
| Service down | | | |
| Error rate spike | | | |
| External provider failure | | | |
| Queue depth growing | | | |
## Score: PASS / PARTIAL / FAIL
Priority Targeting
Run this audit FIRST if:
- Debugging production issues requires SSH into servers and grepping raw logs
- The phrase "I can't tell what happened" comes up regularly
- Error messages shown to users are generic
- External service failures take hours to diagnose
- No alerting is configured for critical failures
- Log storage is unstructured text files
Install this skill directly: skilldb add production-audit-skills