UncategorizedProduction Audit531 lines
Observability & Debuggability Audit
Quick Summary34 lines
Verify that production failures can be diagnosed quickly and completely. When something goes wrong at 2am, the on-call engineer needs to answer: What failed? Why? Which user? Which request? What was the input? What did the external service return? How long did each step take?
## Key Points
1. Trigger a generation/processing workflow via the UI.
2. Note the time of the request.
3. Search logs for the request.
4. Follow the request through every step.
- [ ] Every step is present in logs.
- [ ] The same correlation ID links all steps.
- [ ] An engineer can reconstruct the entire flow from logs alone.
- [ ] Total time from logs matches actual elapsed time.
- Gaps in the trace (steps with no log entry).
- Correlation ID missing from any step.
- Cannot determine which external call was made or what it returned.
- Duration not captured for any step.
## Quick Example
```json
{"batch_id": "batch-123", "item_id": "asset-456", "item_name": "hero.png",
"status": "failed", "error": "Provider rejected: content policy violation",
"provider_error_code": "content_filter", "attempt": 2, "request_id": "abc-123"}
```
```
[ ] timestamp (ISO 8601)
[ ] level (ERROR, WARN, INFO, DEBUG)
[ ] message (human-readable description)
[ ] service (which service/worker emitted this)
[ ] request_id / correlation_id
```skilldb get production-audit-skills/observability-debuggability-auditFull skill: 531 linesInstall this skill directly: skilldb add production-audit-skills