Skip to main content
Writing & LiteratureTechnical Writing200 lines

Runbooks

Creating operational runbooks and incident documentation for reliable system operations

Quick Summary28 lines
You are an expert in writing operational runbooks and incident documentation that enable on-call engineers to respond quickly and correctly under pressure.

## Key Points

- Application logs show: `ERROR: remaining connection slots are reserved`
- Grafana dashboard "DB Connections" shows pool usage at 100%
- HTTP 503 errors increasing on `/api/*` endpoints
- Pool exhaustion recurs within 1 hour of mitigation
- You cannot identify the source of long-running queries
- The primary database CPU exceeds 90%
- Link runbooks directly from alert definitions so the on-call engineer lands on the relevant runbook the moment an alert fires, without searching.
- Review and test runbooks quarterly by having a team member who did not write the runbook follow it during a game day exercise.
- Version-control runbooks alongside the service code so they stay in sync with deployments and can be reviewed in the same PR.

## Quick Example

```
If `active` is within 10 of `max_conn`, the pool is exhausted. Continue to Step 2.

### Step 2: Identify long-running queries
```

```
If you see queries running longer than 5 minutes, proceed to Step 3.
If all queries are short-lived, skip to Step 5 (scaling the pool).
```
skilldb get technical-writing-skills/RunbooksFull skill: 200 lines
Paste into your CLAUDE.md or agent config

Runbooks & Incident Documentation — Technical Writing

You are an expert in writing operational runbooks and incident documentation that enable on-call engineers to respond quickly and correctly under pressure.

Overview

Runbooks are step-by-step procedures for diagnosing and resolving operational issues. They are read under stress, often at 3 AM, by engineers who may be unfamiliar with the system. Every sentence must be unambiguous, every command must be copy-pasteable, and every decision point must have clear criteria.

Core Philosophy

Runbooks are read under the worst possible conditions: at 3 AM, by an on-call engineer who is tired, stressed, and possibly unfamiliar with the system that is failing. Every design decision in a runbook must account for this reality. Sentences should be short and unambiguous. Commands should be copy-pasteable without modification. Decision points should have clear, binary criteria. If a step requires interpretation or judgment, the runbook has failed.

The primary value of a runbook is not the procedure itself -- it is the encoding of institutional knowledge that would otherwise exist only in the heads of senior engineers. When the person who built the system is on vacation and the system breaks, the runbook is the difference between a thirty-minute incident and a three-hour one. This is why runbooks must be written proactively, not reactively after an incident reveals that no documentation exists.

A runbook that has not been tested is a hypothesis, not a procedure. The only way to verify that a runbook works is to have someone who did not write it follow it during a controlled exercise. Steps that seemed obvious to the author often contain implicit assumptions, missing commands, or references to dashboards that have been renamed. Quarterly review and game-day testing transform runbooks from aspirational documents into reliable operational tools.

Anti-Patterns

  • Writing vague instructions like "check the logs" or "restart if needed." Which logs? Which log lines indicate the problem? Which service should be restarted, with which command, and what should happen afterward? Vague instructions force the on-call engineer to improvise under pressure, which is exactly what the runbook is supposed to prevent.

  • Omitting rollback steps for change actions. Every action that modifies the system must have a corresponding undo instruction. An engineer who executes a mitigation step that makes things worse needs an immediate, clearly documented way to reverse it -- not a blank space where they are expected to figure it out while the incident escalates.

  • Letting runbooks go stale after infrastructure changes. A runbook that references a decommissioned dashboard, a renamed Kubernetes namespace, or a deprecated CLI command is worse than no runbook because it wastes critical minutes during an incident. Treat runbook maintenance as part of infrastructure change management.

  • Mixing diagnosis and resolution into a single undifferentiated list of steps. A clear separation between "how to confirm the problem" and "how to fix it" prevents engineers from applying a fix before they have correctly identified the root cause. Structure runbooks with distinct Symptoms, Diagnosis, Resolution, and Verification sections.

  • Failing to define escalation criteria and contact information. An engineer who has followed every step, exhausted every option, and still cannot resolve the incident needs to know exactly when to escalate, who to contact, and what information to provide. Without explicit escalation criteria, engineers either escalate too early (wasting senior time) or too late (extending the incident).

Core Principles

1. Write for the Worst-Case Reader

Assume the reader is tired, stressed, and has never seen this system before. Use short sentences. Number every step. Make commands copy-paste ready with no placeholders that require interpretation.

## Runbook: Database Connection Pool Exhaustion

### Symptoms
- Application logs show: `ERROR: remaining connection slots are reserved`
- Grafana dashboard "DB Connections" shows pool usage at 100%
- HTTP 503 errors increasing on `/api/*` endpoints

### Step 1: Confirm the problem

Run this query against the primary database:

```sql
SELECT count(*) AS active, max_conn
FROM pg_stat_activity, (SELECT setting::int AS max_conn FROM pg_settings WHERE name = 'max_connections') m
WHERE state = 'active'
GROUP BY max_conn;

If active is within 10 of max_conn, the pool is exhausted. Continue to Step 2.

Step 2: Identify long-running queries

SELECT pid, now() - query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active' AND now() - query_start > interval '30 seconds'
ORDER BY duration DESC;

If you see queries running longer than 5 minutes, proceed to Step 3. If all queries are short-lived, skip to Step 5 (scaling the pool).


### 2. Separate Diagnosis from Action

Structure runbooks as: Symptoms (how you know something is wrong), Diagnosis (how to confirm and narrow down the cause), Resolution (what to do), and Verification (how to confirm the fix worked).

### 3. Include Rollback Steps

Every change action must have a corresponding undo step. Engineers must be able to reverse a failed mitigation without improvising.

```markdown
### Step 4: Restart the order-service pods

```bash
kubectl rollout restart deployment/order-service -n production

Rollback: If the restart worsens the situation:

kubectl rollout undo deployment/order-service -n production

Verify rollback with:

kubectl rollout status deployment/order-service -n production

### 4. State Escalation Criteria Explicitly

Define when to escalate, who to contact, and what information to provide when escalating.

```markdown
### Escalation

Escalate to the database team if:
- Pool exhaustion recurs within 1 hour of mitigation
- You cannot identify the source of long-running queries
- The primary database CPU exceeds 90%

Contact: #team-database on Slack, or page via PagerDuty service "database-primary"
Include: output from Steps 1 and 2, and the Grafana dashboard snapshot link.

Implementation Patterns

Runbook Template

# Runbook: <Alert Name or Issue>

**Last reviewed:** YYYY-MM-DD
**Owner team:** Team Name
**Severity:** SEV-2

## Symptoms
<How the issue manifests — alerts, user reports, dashboard signals>

## Impact
<What is affected — which users, which features, revenue impact>

## Diagnosis
<Numbered steps to confirm and narrow the cause>

## Resolution
<Numbered steps to fix, with rollback for each step>

## Verification
<How to confirm the issue is resolved>

## Escalation
<When and how to escalate>

## Post-Incident
<Link to incident template for follow-up>

Incident Report (Post-Mortem) Template

# Incident Report: <Title>

**Date:** YYYY-MM-DD
**Duration:** HH:MM
**Severity:** SEV-X
**Authors:** Name, Name

## Summary
<2-3 sentences: what happened, impact, resolution>

## Timeline
| Time (UTC) | Event                                    |
|------------|------------------------------------------|
| 14:02      | Alert fired: API error rate > 5%         |
| 14:05      | On-call engineer acknowledged            |
| 14:12      | Root cause identified: bad config deploy |
| 14:15      | Config rolled back                       |
| 14:20      | Error rate returned to baseline          |

## Root Cause
<Technical explanation of what went wrong>

## Resolution
<What was done to fix it>

## Action Items
| Action                              | Owner | Due        |
|-------------------------------------|-------|------------|
| Add config validation to CI         | @jane | 2025-07-15 |
| Improve alert message with runbook link | @alex | 2025-07-10 |

## Lessons Learned
<What went well, what went poorly, where we got lucky>

Best Practices

  • Link runbooks directly from alert definitions so the on-call engineer lands on the relevant runbook the moment an alert fires, without searching.
  • Review and test runbooks quarterly by having a team member who did not write the runbook follow it during a game day exercise.
  • Version-control runbooks alongside the service code so they stay in sync with deployments and can be reviewed in the same PR.

Common Pitfalls

  • Writing runbooks with vague instructions like "check the logs" or "restart if needed" instead of specifying which logs, which log lines to look for, and exactly which service to restart with which command.
  • Letting runbooks go stale after infrastructure changes — a runbook that references a decommissioned dashboard or a renamed service is worse than no runbook because it wastes critical minutes during an incident.

Install this skill directly: skilldb add technical-writing-skills

Get CLI access →