Skip to main content
Technology & EngineeringObservability Patterns211 lines

Sli Slo

SLI, SLO, and error budget patterns for defining and managing service reliability targets

Quick Summary18 lines
You are an expert in service level indicators, objectives, and error budgets for building observable systems.

## Key Points

- **SLO (Service Level Objective)**: A target value for an SLI over a rolling or calendar time window. Example: "99.9% of requests succeed within 500ms, measured over a rolling 30-day window."
- **Error budget**: The complement of the SLO target. A 99.9% SLO gives a 0.1% error budget — roughly 43 minutes of total downtime per 30-day window.
- **Burn rate**: How fast the error budget is being consumed relative to the expected rate. A burn rate of 1.0 means the budget will be exhausted exactly at the end of the window.
- **Multi-window burn rate alerting**: Alerting on error budget consumption using both a fast window (detects sudden spikes) and a slow window (confirms sustained degradation).
- name: sli-availability
- name: sli-error-budget
- name: slo-burn-rate
- name: api-availability
- Ship features at normal velocity
- Standard change management process
- Halt risky deployments (large refactors, infra migrations)
- Prioritize reliability-related tickets
skilldb get observability-patterns-skills/Sli SloFull skill: 211 lines
Paste into your CLAUDE.md or agent config

SLIs, SLOs, and Error Budgets — Observability

You are an expert in service level indicators, objectives, and error budgets for building observable systems.

Overview

SLIs (Service Level Indicators), SLOs (Service Level Objectives), and error budgets form a framework for quantifying and managing reliability. An SLI measures a specific aspect of service quality from the user's perspective. An SLO sets a target for that SLI over a time window. The error budget is the allowed amount of unreliability (100% minus the SLO target). This framework aligns engineering, product, and business teams around a shared definition of "reliable enough" and provides a principled basis for balancing feature velocity against reliability investment.

Core Concepts

  • SLI (Service Level Indicator): A quantitative measure of service behavior. Expressed as a ratio: good events / total events. Example: proportion of HTTP requests that return in under 500ms with a non-5xx status.
  • SLO (Service Level Objective): A target value for an SLI over a rolling or calendar time window. Example: "99.9% of requests succeed within 500ms, measured over a rolling 30-day window."
  • Error budget: The complement of the SLO target. A 99.9% SLO gives a 0.1% error budget — roughly 43 minutes of total downtime per 30-day window.
  • SLA (Service Level Agreement): A contractual commitment (often with financial penalties) typically set looser than the internal SLO. SLOs are internal engineering targets; SLAs are customer-facing contracts.
  • Burn rate: How fast the error budget is being consumed relative to the expected rate. A burn rate of 1.0 means the budget will be exhausted exactly at the end of the window.
  • Multi-window burn rate alerting: Alerting on error budget consumption using both a fast window (detects sudden spikes) and a slow window (confirms sustained degradation).

Implementation Patterns

Defining SLIs in Prometheus

# recording_rules.yml — SLI computation
groups:
  - name: sli-availability
    interval: 1m
    rules:
      # Availability SLI: proportion of non-5xx responses
      - record: sli:http_availability:ratio_rate5m
        expr: |
          sum(rate(http_requests_total{status!~"5.."}[5m])) by (service)
          / sum(rate(http_requests_total[5m])) by (service)

      # Latency SLI: proportion of requests under 500ms
      - record: sli:http_latency_good:ratio_rate5m
        expr: |
          sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m])) by (service)
          / sum(rate(http_request_duration_seconds_count[5m])) by (service)

  - name: sli-error-budget
    interval: 1m
    rules:
      # 30-day rolling error budget remaining (availability)
      # Target: 99.9% -> budget = 0.1%
      - record: sli:error_budget_remaining:ratio
        expr: |
          1 - (
            (1 - sli:http_availability:ratio_rate30d)
            / (1 - 0.999)
          )

Multi-window burn rate alerts

This pattern (from the Google SRE Workbook) uses two windows to catch both sudden outages and slow burns while minimizing false positives.

# alerting_rules.yml
groups:
  - name: slo-burn-rate
    rules:
      # Fast burn: 14.4x burn rate over 1h, confirmed over 5m
      # Exhausts budget in ~2 days — page immediately
      - alert: SLOHighBurnRate
        expr: |
          (
            1 - sli:http_availability:ratio_rate1h > (14.4 * 0.001)
            and
            1 - sli:http_availability:ratio_rate5m > (14.4 * 0.001)
          )
        for: 2m
        labels:
          severity: critical
          sloth_slo: availability
        annotations:
          summary: "{{ $labels.service }}: high error budget burn rate (1h window)"
          description: "Error budget will be exhausted in ~2 days at the current rate."

      # Medium burn: 6x burn rate over 6h, confirmed over 30m
      # Exhausts budget in ~5 days — page
      - alert: SLOMediumBurnRate
        expr: |
          (
            1 - sli:http_availability:ratio_rate6h > (6 * 0.001)
            and
            1 - sli:http_availability:ratio_rate30m > (6 * 0.001)
          )
        for: 5m
        labels:
          severity: critical

      # Slow burn: 1x burn rate over 3d, confirmed over 6h
      # On track to exhaust budget — ticket / warning
      - alert: SLOSlowBurnRate
        expr: |
          (
            1 - sli:http_availability:ratio_rate3d > (1 * 0.001)
            and
            1 - sli:http_availability:ratio_rate6h > (1 * 0.001)
          )
        for: 30m
        labels:
          severity: warning

SLO document template

# slo-spec.yaml — machine-readable SLO definition
slos:
  - name: api-availability
    service: order-service
    description: "Order API returns non-5xx responses"
    sli:
      type: availability
      good_events: 'http_requests_total{status!~"5.."}'
      total_events: 'http_requests_total'
    objective:
      target: 0.999          # 99.9%
      window: 30d             # rolling 30-day window
    error_budget:
      monthly_minutes: 43.2   # ~43 min per 30 days
    alerting:
      page_burn_rate: 14.4    # pages on-call
      ticket_burn_rate: 1.0   # creates a ticket
    owner: platform-team
    escalation: platform-oncall

Grafana dashboard queries for SLO tracking

# Current SLI (availability, 30d rolling)
sli:http_availability:ratio_rate30d{service="order-service"}

# Error budget remaining as percentage
sli:error_budget_remaining:ratio{service="order-service"} * 100

# Error budget consumption over time (visualize as a time series)
1 - sli:error_budget_remaining:ratio{service="order-service"}

# Budget burn rate (current hourly rate / expected rate)
(1 - sli:http_availability:ratio_rate1h) / (1 - 0.999)

Error budget policy (organizational process)

## Error Budget Policy — Order Service

### When budget is healthy (> 50% remaining)
- Ship features at normal velocity
- Standard change management process

### When budget is concerning (25-50% remaining)
- Halt risky deployments (large refactors, infra migrations)
- Prioritize reliability-related tickets

### When budget is low (< 25% remaining)
- Feature freeze for this service
- All engineering effort directed at reliability improvements
- Postmortem required for any further budget-consuming incidents

### When budget is exhausted (0% remaining)
- Complete feature freeze
- Mandatory reliability sprint
- Escalation to engineering leadership

Core Philosophy

SLOs answer the most important question in reliability engineering: "how reliable is reliable enough?" Without an SLO, the implicit target is 100% uptime — a target that is unachievable, paralyzing, and in conflict with shipping features. An explicit SLO (say, 99.9%) acknowledges that some unreliability is acceptable and quantifies exactly how much. This transforms reliability from an unbounded aspiration into a measurable, manageable budget. The error budget is the operational consequence: a 99.9% SLO gives you 43 minutes of downtime per month to spend on deployments, experiments, and migrations. When the budget is healthy, ship fast. When it is exhausted, stop and fix.

SLIs must measure what users experience, not what the system reports. A server returning HTTP 200 does not mean the user had a good experience — the response might have taken 30 seconds, the CDN might have served a stale error page, or the client might have timed out before the response arrived. The best SLIs are measured at the point closest to the user: at the load balancer, API gateway, or (ideally) in the client itself. Internal metrics like CPU utilization, queue depth, or pod restart count are diagnostic signals, not SLIs. They explain why an SLI is degraded but should never be the SLI itself.

Multi-window burn rate alerting is the correct way to alert on SLO violations, and it took the industry years to figure this out. Alerting directly when the SLI drops below the target (e.g., availability < 99.9%) is too slow for sudden outages (a ten-minute total outage barely moves the 30-day average) and too noisy for gradual drift. Burn rate alerting asks a different question: "how fast are we consuming our error budget?" A 14.4x burn rate means the budget will be exhausted in two days — that demands immediate action. A 1x burn rate means you are on track to just barely exhaust the budget by the end of the window — that warrants a ticket, not a page. This approach detects both sudden spikes and slow burns while minimizing false alerts.

Anti-Patterns

  • Setting SLOs at 100%. A 100% target means zero error budget, which means any deployment that causes even one error is a violation. This makes every change terrifying and every incident a crisis. Set targets based on real user tolerance and historical performance, not aspiration.

  • SLOs without an error budget policy. Defining SLOs and displaying them on dashboards without an organizational agreement on what happens when the budget is exhausted makes them aspirational, not operational. Establish the policy (feature freeze, reliability sprint, deployment restrictions) before you need to invoke it.

  • Alerting on SLI threshold crossings. Firing an alert when availability < 99.9% over a 30-day window responds too slowly to sudden outages (the average barely moves) and too sensitively to gradual drift. Use multi-window burn rate alerting, which detects both fast and slow budget consumption with appropriate urgency.

  • Using server-side metrics as SLIs. A server that returns 200 in 50ms has done its job, but the user might have experienced a 5-second page load due to DNS, CDN, or client-side rendering issues. Measure SLIs at the point closest to the user — ideally at the edge or in the client.

  • Too many SLOs per service. Defining SLOs for every metric and every endpoint dilutes attention and makes it unclear which SLOs actually matter. Start with one availability SLO and one latency SLO per critical service. Add more only when you have a specific reliability question that existing SLOs cannot answer.

Best Practices

  • Choose SLIs that reflect user experience. Availability and latency at the edge (load balancer, API gateway) are almost always the right starting SLIs. Internal queue depth is not an SLI.
  • Start with fewer SLOs. One availability SLO and one latency SLO per critical service is enough to begin. Add more only when needed.
  • Use rolling windows over calendar windows. A rolling 30-day window avoids the "budget reset" effect at month boundaries that encourages risky end-of-month deploys.
  • Set targets based on real user tolerance, not aspiration. Analyze historical performance and user complaints to find the threshold where users notice degradation.
  • Make error budgets visible. Display remaining budget on team dashboards. Integrate budget status into deployment pipelines (block deploys when budget is exhausted).
  • Establish an error budget policy before you need it. Get organizational buy-in on what happens at each budget level before an incident forces the conversation.

Common Pitfalls

  • Setting SLOs at 100%. A 100% target means zero error budget, which means any deployment or config change that causes even one error is a violation. This is unachievable and paralyzing.
  • Using server-side metrics when client-side metrics are available. A server returning 200 does not mean the user received a good response — network errors, timeouts, and CDN issues are invisible to server metrics.
  • Confusing SLOs with SLAs. Setting your internal SLO at the same level as your customer SLA leaves no safety margin. Internal SLOs should be tighter.
  • Ignoring the error budget. Defining SLOs without an enforcement policy makes them aspirational dashboards, not operational tools.
  • Alerting directly on SLI threshold crossings. Alerting when availability < 99.9% is too slow for sudden outages and too noisy for gradual drift. Use multi-window burn rate alerting instead.

Install this skill directly: skilldb add observability-patterns-skills

Get CLI access →