Sli Slo
SLI, SLO, and error budget patterns for defining and managing service reliability targets
You are an expert in service level indicators, objectives, and error budgets for building observable systems. ## Key Points - **SLO (Service Level Objective)**: A target value for an SLI over a rolling or calendar time window. Example: "99.9% of requests succeed within 500ms, measured over a rolling 30-day window." - **Error budget**: The complement of the SLO target. A 99.9% SLO gives a 0.1% error budget — roughly 43 minutes of total downtime per 30-day window. - **Burn rate**: How fast the error budget is being consumed relative to the expected rate. A burn rate of 1.0 means the budget will be exhausted exactly at the end of the window. - **Multi-window burn rate alerting**: Alerting on error budget consumption using both a fast window (detects sudden spikes) and a slow window (confirms sustained degradation). - name: sli-availability - name: sli-error-budget - name: slo-burn-rate - name: api-availability - Ship features at normal velocity - Standard change management process - Halt risky deployments (large refactors, infra migrations) - Prioritize reliability-related tickets
skilldb get observability-patterns-skills/Sli SloFull skill: 211 linesSLIs, SLOs, and Error Budgets — Observability
You are an expert in service level indicators, objectives, and error budgets for building observable systems.
Overview
SLIs (Service Level Indicators), SLOs (Service Level Objectives), and error budgets form a framework for quantifying and managing reliability. An SLI measures a specific aspect of service quality from the user's perspective. An SLO sets a target for that SLI over a time window. The error budget is the allowed amount of unreliability (100% minus the SLO target). This framework aligns engineering, product, and business teams around a shared definition of "reliable enough" and provides a principled basis for balancing feature velocity against reliability investment.
Core Concepts
- SLI (Service Level Indicator): A quantitative measure of service behavior. Expressed as a ratio:
good events / total events. Example: proportion of HTTP requests that return in under 500ms with a non-5xx status. - SLO (Service Level Objective): A target value for an SLI over a rolling or calendar time window. Example: "99.9% of requests succeed within 500ms, measured over a rolling 30-day window."
- Error budget: The complement of the SLO target. A 99.9% SLO gives a 0.1% error budget — roughly 43 minutes of total downtime per 30-day window.
- SLA (Service Level Agreement): A contractual commitment (often with financial penalties) typically set looser than the internal SLO. SLOs are internal engineering targets; SLAs are customer-facing contracts.
- Burn rate: How fast the error budget is being consumed relative to the expected rate. A burn rate of 1.0 means the budget will be exhausted exactly at the end of the window.
- Multi-window burn rate alerting: Alerting on error budget consumption using both a fast window (detects sudden spikes) and a slow window (confirms sustained degradation).
Implementation Patterns
Defining SLIs in Prometheus
# recording_rules.yml — SLI computation
groups:
- name: sli-availability
interval: 1m
rules:
# Availability SLI: proportion of non-5xx responses
- record: sli:http_availability:ratio_rate5m
expr: |
sum(rate(http_requests_total{status!~"5.."}[5m])) by (service)
/ sum(rate(http_requests_total[5m])) by (service)
# Latency SLI: proportion of requests under 500ms
- record: sli:http_latency_good:ratio_rate5m
expr: |
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m])) by (service)
/ sum(rate(http_request_duration_seconds_count[5m])) by (service)
- name: sli-error-budget
interval: 1m
rules:
# 30-day rolling error budget remaining (availability)
# Target: 99.9% -> budget = 0.1%
- record: sli:error_budget_remaining:ratio
expr: |
1 - (
(1 - sli:http_availability:ratio_rate30d)
/ (1 - 0.999)
)
Multi-window burn rate alerts
This pattern (from the Google SRE Workbook) uses two windows to catch both sudden outages and slow burns while minimizing false positives.
# alerting_rules.yml
groups:
- name: slo-burn-rate
rules:
# Fast burn: 14.4x burn rate over 1h, confirmed over 5m
# Exhausts budget in ~2 days — page immediately
- alert: SLOHighBurnRate
expr: |
(
1 - sli:http_availability:ratio_rate1h > (14.4 * 0.001)
and
1 - sli:http_availability:ratio_rate5m > (14.4 * 0.001)
)
for: 2m
labels:
severity: critical
sloth_slo: availability
annotations:
summary: "{{ $labels.service }}: high error budget burn rate (1h window)"
description: "Error budget will be exhausted in ~2 days at the current rate."
# Medium burn: 6x burn rate over 6h, confirmed over 30m
# Exhausts budget in ~5 days — page
- alert: SLOMediumBurnRate
expr: |
(
1 - sli:http_availability:ratio_rate6h > (6 * 0.001)
and
1 - sli:http_availability:ratio_rate30m > (6 * 0.001)
)
for: 5m
labels:
severity: critical
# Slow burn: 1x burn rate over 3d, confirmed over 6h
# On track to exhaust budget — ticket / warning
- alert: SLOSlowBurnRate
expr: |
(
1 - sli:http_availability:ratio_rate3d > (1 * 0.001)
and
1 - sli:http_availability:ratio_rate6h > (1 * 0.001)
)
for: 30m
labels:
severity: warning
SLO document template
# slo-spec.yaml — machine-readable SLO definition
slos:
- name: api-availability
service: order-service
description: "Order API returns non-5xx responses"
sli:
type: availability
good_events: 'http_requests_total{status!~"5.."}'
total_events: 'http_requests_total'
objective:
target: 0.999 # 99.9%
window: 30d # rolling 30-day window
error_budget:
monthly_minutes: 43.2 # ~43 min per 30 days
alerting:
page_burn_rate: 14.4 # pages on-call
ticket_burn_rate: 1.0 # creates a ticket
owner: platform-team
escalation: platform-oncall
Grafana dashboard queries for SLO tracking
# Current SLI (availability, 30d rolling)
sli:http_availability:ratio_rate30d{service="order-service"}
# Error budget remaining as percentage
sli:error_budget_remaining:ratio{service="order-service"} * 100
# Error budget consumption over time (visualize as a time series)
1 - sli:error_budget_remaining:ratio{service="order-service"}
# Budget burn rate (current hourly rate / expected rate)
(1 - sli:http_availability:ratio_rate1h) / (1 - 0.999)
Error budget policy (organizational process)
## Error Budget Policy — Order Service
### When budget is healthy (> 50% remaining)
- Ship features at normal velocity
- Standard change management process
### When budget is concerning (25-50% remaining)
- Halt risky deployments (large refactors, infra migrations)
- Prioritize reliability-related tickets
### When budget is low (< 25% remaining)
- Feature freeze for this service
- All engineering effort directed at reliability improvements
- Postmortem required for any further budget-consuming incidents
### When budget is exhausted (0% remaining)
- Complete feature freeze
- Mandatory reliability sprint
- Escalation to engineering leadership
Core Philosophy
SLOs answer the most important question in reliability engineering: "how reliable is reliable enough?" Without an SLO, the implicit target is 100% uptime — a target that is unachievable, paralyzing, and in conflict with shipping features. An explicit SLO (say, 99.9%) acknowledges that some unreliability is acceptable and quantifies exactly how much. This transforms reliability from an unbounded aspiration into a measurable, manageable budget. The error budget is the operational consequence: a 99.9% SLO gives you 43 minutes of downtime per month to spend on deployments, experiments, and migrations. When the budget is healthy, ship fast. When it is exhausted, stop and fix.
SLIs must measure what users experience, not what the system reports. A server returning HTTP 200 does not mean the user had a good experience — the response might have taken 30 seconds, the CDN might have served a stale error page, or the client might have timed out before the response arrived. The best SLIs are measured at the point closest to the user: at the load balancer, API gateway, or (ideally) in the client itself. Internal metrics like CPU utilization, queue depth, or pod restart count are diagnostic signals, not SLIs. They explain why an SLI is degraded but should never be the SLI itself.
Multi-window burn rate alerting is the correct way to alert on SLO violations, and it took the industry years to figure this out. Alerting directly when the SLI drops below the target (e.g., availability < 99.9%) is too slow for sudden outages (a ten-minute total outage barely moves the 30-day average) and too noisy for gradual drift. Burn rate alerting asks a different question: "how fast are we consuming our error budget?" A 14.4x burn rate means the budget will be exhausted in two days — that demands immediate action. A 1x burn rate means you are on track to just barely exhaust the budget by the end of the window — that warrants a ticket, not a page. This approach detects both sudden spikes and slow burns while minimizing false alerts.
Anti-Patterns
-
Setting SLOs at 100%. A 100% target means zero error budget, which means any deployment that causes even one error is a violation. This makes every change terrifying and every incident a crisis. Set targets based on real user tolerance and historical performance, not aspiration.
-
SLOs without an error budget policy. Defining SLOs and displaying them on dashboards without an organizational agreement on what happens when the budget is exhausted makes them aspirational, not operational. Establish the policy (feature freeze, reliability sprint, deployment restrictions) before you need to invoke it.
-
Alerting on SLI threshold crossings. Firing an alert when
availability < 99.9%over a 30-day window responds too slowly to sudden outages (the average barely moves) and too sensitively to gradual drift. Use multi-window burn rate alerting, which detects both fast and slow budget consumption with appropriate urgency. -
Using server-side metrics as SLIs. A server that returns 200 in 50ms has done its job, but the user might have experienced a 5-second page load due to DNS, CDN, or client-side rendering issues. Measure SLIs at the point closest to the user — ideally at the edge or in the client.
-
Too many SLOs per service. Defining SLOs for every metric and every endpoint dilutes attention and makes it unclear which SLOs actually matter. Start with one availability SLO and one latency SLO per critical service. Add more only when you have a specific reliability question that existing SLOs cannot answer.
Best Practices
- Choose SLIs that reflect user experience. Availability and latency at the edge (load balancer, API gateway) are almost always the right starting SLIs. Internal queue depth is not an SLI.
- Start with fewer SLOs. One availability SLO and one latency SLO per critical service is enough to begin. Add more only when needed.
- Use rolling windows over calendar windows. A rolling 30-day window avoids the "budget reset" effect at month boundaries that encourages risky end-of-month deploys.
- Set targets based on real user tolerance, not aspiration. Analyze historical performance and user complaints to find the threshold where users notice degradation.
- Make error budgets visible. Display remaining budget on team dashboards. Integrate budget status into deployment pipelines (block deploys when budget is exhausted).
- Establish an error budget policy before you need it. Get organizational buy-in on what happens at each budget level before an incident forces the conversation.
Common Pitfalls
- Setting SLOs at 100%. A 100% target means zero error budget, which means any deployment or config change that causes even one error is a violation. This is unachievable and paralyzing.
- Using server-side metrics when client-side metrics are available. A server returning 200 does not mean the user received a good response — network errors, timeouts, and CDN issues are invisible to server metrics.
- Confusing SLOs with SLAs. Setting your internal SLO at the same level as your customer SLA leaves no safety margin. Internal SLOs should be tighter.
- Ignoring the error budget. Defining SLOs without an enforcement policy makes them aspirational dashboards, not operational tools.
- Alerting directly on SLI threshold crossings. Alerting when
availability < 99.9%is too slow for sudden outages and too noisy for gradual drift. Use multi-window burn rate alerting instead.
Install this skill directly: skilldb add observability-patterns-skills
Related Skills
Alerting Strategies
On-call alerting strategies for actionable, low-noise alert systems that reduce fatigue and improve response times
Distributed Tracing
OpenTelemetry distributed tracing patterns for end-to-end request visibility across microservices
Health Checks
Health check endpoint patterns for liveness, readiness, and startup probes in distributed services
Incident Response
Incident response and postmortem patterns for structured handling, communication, and learning from production incidents
Log Aggregation
Centralized log aggregation patterns for collecting, indexing, and querying logs across distributed systems
Metrics Collection
Prometheus and Grafana metrics collection patterns for monitoring application and infrastructure health