Health Checks
Health check endpoint patterns for liveness, readiness, and startup probes in distributed services
You are an expert in health check patterns for building observable systems. ## Key Points - **Liveness probe**: Answers "is this process stuck or deadlocked?" A failing liveness probe triggers a container restart. Should check only the process itself, not external dependencies. - **Dependency check**: A sub-check within the readiness probe that verifies connectivity to a specific downstream system (database, message broker, external API). - **Health check aggregation**: Combining individual component checks into an overall status with detail (healthy, degraded, unhealthy). - **Use a dedicated health port.** Serve health checks on a separate port (e.g., 8081) so they are not affected by application middleware, authentication, or rate limiting. - **Keep liveness checks fast and dependency-free.** A liveness check that queries the database will cause cascading restarts when the database is slow, making a bad situation worse. - **Include dependency latency in readiness responses.** Reporting `latency_ms` for each check helps operators spot degradation before it becomes an outage. - **Use startup probes for slow services.** Without a startup probe, a liveness probe with a tight threshold will kill containers that are still initializing. - **Return structured JSON.** Include individual check statuses and overall status so monitoring tools and dashboards can parse the response programmatically. - **Timeouts longer than probe intervals.** If the probe timeout is 10 seconds but the interval is 5 seconds, multiple probes overlap, wasting resources. Keep `timeout < interval`. - **Not using a startup probe.** Java services or services that load large caches can take minutes to start. Without a startup probe, the liveness probe kills them during initialization.
skilldb get observability-patterns-skills/Health ChecksFull skill: 267 linesHealth Checks — Observability
You are an expert in health check patterns for building observable systems.
Overview
Health checks are lightweight endpoints that report whether a service is alive, ready to accept traffic, and whether its dependencies are reachable. They are the foundation of automated recovery (container orchestrators restart unhealthy instances), load balancer routing (remove unhealthy backends), and deployment safety (readiness gates). A missing or poorly designed health check leads to traffic being sent to broken instances or healthy instances being needlessly restarted.
Core Concepts
- Liveness probe: Answers "is this process stuck or deadlocked?" A failing liveness probe triggers a container restart. Should check only the process itself, not external dependencies.
- Readiness probe: Answers "can this instance serve traffic right now?" A failing readiness probe removes the instance from the load balancer but does not restart it. Should verify that required dependencies (database, cache) are reachable.
- Startup probe: Answers "has this process finished initializing?" Used for slow-starting applications to prevent liveness probes from killing a container that is still loading data or warming caches.
- Dependency check: A sub-check within the readiness probe that verifies connectivity to a specific downstream system (database, message broker, external API).
- Health check aggregation: Combining individual component checks into an overall status with detail (healthy, degraded, unhealthy).
Implementation Patterns
Python — FastAPI health endpoints
from fastapi import FastAPI, Response
from enum import Enum
import asyncpg
import aioredis
import time
app = FastAPI()
class Status(str, Enum):
HEALTHY = "healthy"
DEGRADED = "degraded"
UNHEALTHY = "unhealthy"
# Liveness: only checks if the process is responsive
@app.get("/healthz")
async def liveness():
return {"status": "ok"}
# Readiness: checks critical dependencies
@app.get("/readyz")
async def readiness(response: Response):
checks = {}
overall = Status.HEALTHY
# Database check
try:
start = time.perf_counter()
conn = await asyncpg.connect(dsn="postgresql://localhost/mydb", timeout=2)
await conn.fetchval("SELECT 1")
await conn.close()
checks["database"] = {
"status": Status.HEALTHY,
"latency_ms": round((time.perf_counter() - start) * 1000, 1),
}
except Exception as e:
checks["database"] = {"status": Status.UNHEALTHY, "error": str(e)}
overall = Status.UNHEALTHY
# Redis check
try:
start = time.perf_counter()
redis = aioredis.from_url("redis://localhost", socket_timeout=2)
await redis.ping()
await redis.close()
checks["redis"] = {
"status": Status.HEALTHY,
"latency_ms": round((time.perf_counter() - start) * 1000, 1),
}
except Exception as e:
checks["redis"] = {"status": Status.DEGRADED, "error": str(e)}
if overall == Status.HEALTHY:
overall = Status.DEGRADED
if overall == Status.UNHEALTHY:
response.status_code = 503
elif overall == Status.DEGRADED:
response.status_code = 200 # still serve traffic, but signal degradation
return {"status": overall, "checks": checks}
Go — standard library health server
package main
import (
"context"
"database/sql"
"encoding/json"
"net/http"
"time"
_ "github.com/lib/pq"
)
type HealthResponse struct {
Status string `json:"status"`
Checks map[string]CheckResult `json:"checks,omitempty"`
}
type CheckResult struct {
Status string `json:"status"`
LatencyMs float64 `json:"latency_ms,omitempty"`
Error string `json:"error,omitempty"`
}
var db *sql.DB
func livenessHandler(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(HealthResponse{Status: "ok"})
}
func readinessHandler(w http.ResponseWriter, r *http.Request) {
checks := make(map[string]CheckResult)
overall := "healthy"
ctx, cancel := context.WithTimeout(r.Context(), 3*time.Second)
defer cancel()
start := time.Now()
if err := db.PingContext(ctx); err != nil {
checks["database"] = CheckResult{Status: "unhealthy", Error: err.Error()}
overall = "unhealthy"
} else {
checks["database"] = CheckResult{
Status: "healthy",
LatencyMs: float64(time.Since(start).Milliseconds()),
}
}
w.Header().Set("Content-Type", "application/json")
if overall == "unhealthy" {
w.WriteHeader(http.StatusServiceUnavailable)
}
json.NewEncoder(w).Encode(HealthResponse{Status: overall, Checks: checks})
}
func main() {
var err error
db, err = sql.Open("postgres", "postgresql://localhost/mydb?sslmode=disable")
if err != nil {
panic(err)
}
mux := http.NewServeMux()
mux.HandleFunc("/healthz", livenessHandler)
mux.HandleFunc("/readyz", readinessHandler)
http.ListenAndServe(":8081", mux)
}
Kubernetes probe configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-service
spec:
template:
spec:
containers:
- name: order-service
image: order-service:1.4.2
ports:
- containerPort: 8080
- containerPort: 8081 # health port, separate from app traffic
# Startup probe: give slow-starting apps up to 5 minutes
startupProbe:
httpGet:
path: /healthz
port: 8081
failureThreshold: 30
periodSeconds: 10
# Liveness probe: restart if the process is stuck
livenessProbe:
httpGet:
path: /healthz
port: 8081
initialDelaySeconds: 0
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
# Readiness probe: remove from service if dependencies are down
readinessProbe:
httpGet:
path: /readyz
port: 8081
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2
successThreshold: 1
Load balancer health check (AWS ALB example)
# Terraform
resource "aws_lb_target_group" "api" {
name = "api-tg"
port = 8080
protocol = "HTTP"
vpc_id = var.vpc_id
health_check {
enabled = true
path = "/readyz"
port = "8081"
protocol = "HTTP"
healthy_threshold = 2
unhealthy_threshold = 3
timeout = 5
interval = 10
matcher = "200"
}
}
Core Philosophy
Health checks are the contract between your application and the infrastructure that manages it. A liveness check says "I am not stuck — do not restart me." A readiness check says "I can serve traffic right now — send me requests." These are fundamentally different questions with fundamentally different consequences for getting the answer wrong. A liveness check that is too sensitive restarts healthy containers during transient dependency issues, causing cascading failures. A readiness check that is too lenient sends traffic to instances that cannot process it, causing user-facing errors. Getting this contract right is one of the highest-leverage reliability investments you can make.
The most important principle in health check design is that liveness checks must be trivial and dependency-free. A liveness check that queries the database will trigger container restarts when the database is slow — exactly the moment when restarting containers makes the situation worse by adding connection churn to an already overwhelmed database. Liveness should verify only that the process is responsive (can it handle an HTTP request? is the event loop running?). Dependency health belongs exclusively in the readiness check, where a failure removes the instance from the load balancer without destroying it.
Health checks should be informative, not just binary. A readiness endpoint that returns a bare 200 or 503 tells the orchestrator what to do but tells operators nothing about why. A structured JSON response with individual dependency statuses, latency measurements, and version information turns the health endpoint into a lightweight diagnostic tool. When an instance is failing readiness, the response itself should tell you whether the database is unreachable, the cache is slow, or a downstream API is returning errors — without requiring anyone to SSH into the container or search through logs.
Anti-Patterns
-
Dependency checks in the liveness probe. When the database goes down and every pod's liveness check fails because it queries the database, the orchestrator restarts all pods simultaneously. The pods come back, still cannot reach the database, and enter a crash loop. This turns a database issue into a complete application outage.
-
No startup probe for slow services. Java applications, services that load ML models, or services that warm large caches can take minutes to initialize. Without a startup probe, the liveness probe's tight failure threshold kills the container before it finishes starting, creating an infinite restart loop.
-
Binary health responses. Returning only 200 or 503 without any body content means operators must consult logs and dashboards to understand why a service is unhealthy. Return structured JSON with individual check results and latency measurements so the health endpoint itself serves as a diagnostic tool.
-
Health checks behind authentication middleware. If the health endpoint requires a valid auth token and the auth service is down, health checks fail even though the service itself is functional. Serve health endpoints on a separate port or path that bypasses authentication and rate limiting.
-
Hard-failing readiness on optional dependencies. Marking the service as unhealthy because a non-critical dependency (recommendations engine, analytics service) is down removes the service from the load balancer entirely, even though it could still serve its core function. Classify dependencies as critical or optional and report optional failures as "degraded," not "unhealthy."
Best Practices
- Separate liveness from readiness. Liveness should be trivial (process is alive). Readiness should check dependencies. Mixing them causes unnecessary restarts when a dependency is temporarily down.
- Use a dedicated health port. Serve health checks on a separate port (e.g., 8081) so they are not affected by application middleware, authentication, or rate limiting.
- Keep liveness checks fast and dependency-free. A liveness check that queries the database will cause cascading restarts when the database is slow, making a bad situation worse.
- Include dependency latency in readiness responses. Reporting
latency_msfor each check helps operators spot degradation before it becomes an outage. - Use startup probes for slow services. Without a startup probe, a liveness probe with a tight threshold will kill containers that are still initializing.
- Cache dependency check results for a short TTL. If readiness is probed every 5 seconds but the database check takes 2 seconds, cache the result for 3-5 seconds to avoid overloading dependencies with health check queries.
- Return structured JSON. Include individual check statuses and overall status so monitoring tools and dashboards can parse the response programmatically.
Common Pitfalls
- Checking dependencies in the liveness probe. When the database goes down, every pod fails its liveness check and gets restarted simultaneously. The pods come back, still cannot reach the database, and enter a crash loop. The liveness probe should only check the process itself.
- Health checks that require authentication. If the auth service is down, health checks fail, and the service gets removed from the load balancer even though it could serve cached or anonymous requests.
- Timeouts longer than probe intervals. If the probe timeout is 10 seconds but the interval is 5 seconds, multiple probes overlap, wasting resources. Keep
timeout < interval. - Not using a startup probe. Java services or services that load large caches can take minutes to start. Without a startup probe, the liveness probe kills them during initialization.
- Hard-failing readiness on optional dependencies. If the recommendation engine is down, the order service can still process orders — it just shows fewer recommendations. Mark optional dependencies as "degraded," not "unhealthy."
Install this skill directly: skilldb add observability-patterns-skills
Related Skills
Alerting Strategies
On-call alerting strategies for actionable, low-noise alert systems that reduce fatigue and improve response times
Distributed Tracing
OpenTelemetry distributed tracing patterns for end-to-end request visibility across microservices
Incident Response
Incident response and postmortem patterns for structured handling, communication, and learning from production incidents
Log Aggregation
Centralized log aggregation patterns for collecting, indexing, and querying logs across distributed systems
Metrics Collection
Prometheus and Grafana metrics collection patterns for monitoring application and infrastructure health
Sli Slo
SLI, SLO, and error budget patterns for defining and managing service reliability targets