Skip to main content
Technology & EngineeringObservability Patterns267 lines

Health Checks

Health check endpoint patterns for liveness, readiness, and startup probes in distributed services

Quick Summary16 lines
You are an expert in health check patterns for building observable systems.

## Key Points

- **Liveness probe**: Answers "is this process stuck or deadlocked?" A failing liveness probe triggers a container restart. Should check only the process itself, not external dependencies.
- **Dependency check**: A sub-check within the readiness probe that verifies connectivity to a specific downstream system (database, message broker, external API).
- **Health check aggregation**: Combining individual component checks into an overall status with detail (healthy, degraded, unhealthy).
- **Use a dedicated health port.** Serve health checks on a separate port (e.g., 8081) so they are not affected by application middleware, authentication, or rate limiting.
- **Keep liveness checks fast and dependency-free.** A liveness check that queries the database will cause cascading restarts when the database is slow, making a bad situation worse.
- **Include dependency latency in readiness responses.** Reporting `latency_ms` for each check helps operators spot degradation before it becomes an outage.
- **Use startup probes for slow services.** Without a startup probe, a liveness probe with a tight threshold will kill containers that are still initializing.
- **Return structured JSON.** Include individual check statuses and overall status so monitoring tools and dashboards can parse the response programmatically.
- **Timeouts longer than probe intervals.** If the probe timeout is 10 seconds but the interval is 5 seconds, multiple probes overlap, wasting resources. Keep `timeout < interval`.
- **Not using a startup probe.** Java services or services that load large caches can take minutes to start. Without a startup probe, the liveness probe kills them during initialization.
skilldb get observability-patterns-skills/Health ChecksFull skill: 267 lines
Paste into your CLAUDE.md or agent config

Health Checks — Observability

You are an expert in health check patterns for building observable systems.

Overview

Health checks are lightweight endpoints that report whether a service is alive, ready to accept traffic, and whether its dependencies are reachable. They are the foundation of automated recovery (container orchestrators restart unhealthy instances), load balancer routing (remove unhealthy backends), and deployment safety (readiness gates). A missing or poorly designed health check leads to traffic being sent to broken instances or healthy instances being needlessly restarted.

Core Concepts

  • Liveness probe: Answers "is this process stuck or deadlocked?" A failing liveness probe triggers a container restart. Should check only the process itself, not external dependencies.
  • Readiness probe: Answers "can this instance serve traffic right now?" A failing readiness probe removes the instance from the load balancer but does not restart it. Should verify that required dependencies (database, cache) are reachable.
  • Startup probe: Answers "has this process finished initializing?" Used for slow-starting applications to prevent liveness probes from killing a container that is still loading data or warming caches.
  • Dependency check: A sub-check within the readiness probe that verifies connectivity to a specific downstream system (database, message broker, external API).
  • Health check aggregation: Combining individual component checks into an overall status with detail (healthy, degraded, unhealthy).

Implementation Patterns

Python — FastAPI health endpoints

from fastapi import FastAPI, Response
from enum import Enum
import asyncpg
import aioredis
import time

app = FastAPI()

class Status(str, Enum):
    HEALTHY = "healthy"
    DEGRADED = "degraded"
    UNHEALTHY = "unhealthy"

# Liveness: only checks if the process is responsive
@app.get("/healthz")
async def liveness():
    return {"status": "ok"}

# Readiness: checks critical dependencies
@app.get("/readyz")
async def readiness(response: Response):
    checks = {}
    overall = Status.HEALTHY

    # Database check
    try:
        start = time.perf_counter()
        conn = await asyncpg.connect(dsn="postgresql://localhost/mydb", timeout=2)
        await conn.fetchval("SELECT 1")
        await conn.close()
        checks["database"] = {
            "status": Status.HEALTHY,
            "latency_ms": round((time.perf_counter() - start) * 1000, 1),
        }
    except Exception as e:
        checks["database"] = {"status": Status.UNHEALTHY, "error": str(e)}
        overall = Status.UNHEALTHY

    # Redis check
    try:
        start = time.perf_counter()
        redis = aioredis.from_url("redis://localhost", socket_timeout=2)
        await redis.ping()
        await redis.close()
        checks["redis"] = {
            "status": Status.HEALTHY,
            "latency_ms": round((time.perf_counter() - start) * 1000, 1),
        }
    except Exception as e:
        checks["redis"] = {"status": Status.DEGRADED, "error": str(e)}
        if overall == Status.HEALTHY:
            overall = Status.DEGRADED

    if overall == Status.UNHEALTHY:
        response.status_code = 503
    elif overall == Status.DEGRADED:
        response.status_code = 200  # still serve traffic, but signal degradation

    return {"status": overall, "checks": checks}

Go — standard library health server

package main

import (
    "context"
    "database/sql"
    "encoding/json"
    "net/http"
    "time"

    _ "github.com/lib/pq"
)

type HealthResponse struct {
    Status string                 `json:"status"`
    Checks map[string]CheckResult `json:"checks,omitempty"`
}

type CheckResult struct {
    Status    string  `json:"status"`
    LatencyMs float64 `json:"latency_ms,omitempty"`
    Error     string  `json:"error,omitempty"`
}

var db *sql.DB

func livenessHandler(w http.ResponseWriter, r *http.Request) {
    w.Header().Set("Content-Type", "application/json")
    json.NewEncoder(w).Encode(HealthResponse{Status: "ok"})
}

func readinessHandler(w http.ResponseWriter, r *http.Request) {
    checks := make(map[string]CheckResult)
    overall := "healthy"

    ctx, cancel := context.WithTimeout(r.Context(), 3*time.Second)
    defer cancel()

    start := time.Now()
    if err := db.PingContext(ctx); err != nil {
        checks["database"] = CheckResult{Status: "unhealthy", Error: err.Error()}
        overall = "unhealthy"
    } else {
        checks["database"] = CheckResult{
            Status:    "healthy",
            LatencyMs: float64(time.Since(start).Milliseconds()),
        }
    }

    w.Header().Set("Content-Type", "application/json")
    if overall == "unhealthy" {
        w.WriteHeader(http.StatusServiceUnavailable)
    }
    json.NewEncoder(w).Encode(HealthResponse{Status: overall, Checks: checks})
}

func main() {
    var err error
    db, err = sql.Open("postgres", "postgresql://localhost/mydb?sslmode=disable")
    if err != nil {
        panic(err)
    }

    mux := http.NewServeMux()
    mux.HandleFunc("/healthz", livenessHandler)
    mux.HandleFunc("/readyz", readinessHandler)
    http.ListenAndServe(":8081", mux)
}

Kubernetes probe configuration

apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
spec:
  template:
    spec:
      containers:
        - name: order-service
          image: order-service:1.4.2
          ports:
            - containerPort: 8080
            - containerPort: 8081  # health port, separate from app traffic

          # Startup probe: give slow-starting apps up to 5 minutes
          startupProbe:
            httpGet:
              path: /healthz
              port: 8081
            failureThreshold: 30
            periodSeconds: 10

          # Liveness probe: restart if the process is stuck
          livenessProbe:
            httpGet:
              path: /healthz
              port: 8081
            initialDelaySeconds: 0
            periodSeconds: 10
            timeoutSeconds: 3
            failureThreshold: 3

          # Readiness probe: remove from service if dependencies are down
          readinessProbe:
            httpGet:
              path: /readyz
              port: 8081
            periodSeconds: 5
            timeoutSeconds: 3
            failureThreshold: 2
            successThreshold: 1

Load balancer health check (AWS ALB example)

# Terraform
resource "aws_lb_target_group" "api" {
  name     = "api-tg"
  port     = 8080
  protocol = "HTTP"
  vpc_id   = var.vpc_id

  health_check {
    enabled             = true
    path                = "/readyz"
    port                = "8081"
    protocol            = "HTTP"
    healthy_threshold   = 2
    unhealthy_threshold = 3
    timeout             = 5
    interval            = 10
    matcher             = "200"
  }
}

Core Philosophy

Health checks are the contract between your application and the infrastructure that manages it. A liveness check says "I am not stuck — do not restart me." A readiness check says "I can serve traffic right now — send me requests." These are fundamentally different questions with fundamentally different consequences for getting the answer wrong. A liveness check that is too sensitive restarts healthy containers during transient dependency issues, causing cascading failures. A readiness check that is too lenient sends traffic to instances that cannot process it, causing user-facing errors. Getting this contract right is one of the highest-leverage reliability investments you can make.

The most important principle in health check design is that liveness checks must be trivial and dependency-free. A liveness check that queries the database will trigger container restarts when the database is slow — exactly the moment when restarting containers makes the situation worse by adding connection churn to an already overwhelmed database. Liveness should verify only that the process is responsive (can it handle an HTTP request? is the event loop running?). Dependency health belongs exclusively in the readiness check, where a failure removes the instance from the load balancer without destroying it.

Health checks should be informative, not just binary. A readiness endpoint that returns a bare 200 or 503 tells the orchestrator what to do but tells operators nothing about why. A structured JSON response with individual dependency statuses, latency measurements, and version information turns the health endpoint into a lightweight diagnostic tool. When an instance is failing readiness, the response itself should tell you whether the database is unreachable, the cache is slow, or a downstream API is returning errors — without requiring anyone to SSH into the container or search through logs.

Anti-Patterns

  • Dependency checks in the liveness probe. When the database goes down and every pod's liveness check fails because it queries the database, the orchestrator restarts all pods simultaneously. The pods come back, still cannot reach the database, and enter a crash loop. This turns a database issue into a complete application outage.

  • No startup probe for slow services. Java applications, services that load ML models, or services that warm large caches can take minutes to initialize. Without a startup probe, the liveness probe's tight failure threshold kills the container before it finishes starting, creating an infinite restart loop.

  • Binary health responses. Returning only 200 or 503 without any body content means operators must consult logs and dashboards to understand why a service is unhealthy. Return structured JSON with individual check results and latency measurements so the health endpoint itself serves as a diagnostic tool.

  • Health checks behind authentication middleware. If the health endpoint requires a valid auth token and the auth service is down, health checks fail even though the service itself is functional. Serve health endpoints on a separate port or path that bypasses authentication and rate limiting.

  • Hard-failing readiness on optional dependencies. Marking the service as unhealthy because a non-critical dependency (recommendations engine, analytics service) is down removes the service from the load balancer entirely, even though it could still serve its core function. Classify dependencies as critical or optional and report optional failures as "degraded," not "unhealthy."

Best Practices

  • Separate liveness from readiness. Liveness should be trivial (process is alive). Readiness should check dependencies. Mixing them causes unnecessary restarts when a dependency is temporarily down.
  • Use a dedicated health port. Serve health checks on a separate port (e.g., 8081) so they are not affected by application middleware, authentication, or rate limiting.
  • Keep liveness checks fast and dependency-free. A liveness check that queries the database will cause cascading restarts when the database is slow, making a bad situation worse.
  • Include dependency latency in readiness responses. Reporting latency_ms for each check helps operators spot degradation before it becomes an outage.
  • Use startup probes for slow services. Without a startup probe, a liveness probe with a tight threshold will kill containers that are still initializing.
  • Cache dependency check results for a short TTL. If readiness is probed every 5 seconds but the database check takes 2 seconds, cache the result for 3-5 seconds to avoid overloading dependencies with health check queries.
  • Return structured JSON. Include individual check statuses and overall status so monitoring tools and dashboards can parse the response programmatically.

Common Pitfalls

  • Checking dependencies in the liveness probe. When the database goes down, every pod fails its liveness check and gets restarted simultaneously. The pods come back, still cannot reach the database, and enter a crash loop. The liveness probe should only check the process itself.
  • Health checks that require authentication. If the auth service is down, health checks fail, and the service gets removed from the load balancer even though it could serve cached or anonymous requests.
  • Timeouts longer than probe intervals. If the probe timeout is 10 seconds but the interval is 5 seconds, multiple probes overlap, wasting resources. Keep timeout < interval.
  • Not using a startup probe. Java services or services that load large caches can take minutes to start. Without a startup probe, the liveness probe kills them during initialization.
  • Hard-failing readiness on optional dependencies. If the recommendation engine is down, the order service can still process orders — it just shows fewer recommendations. Mark optional dependencies as "degraded," not "unhealthy."

Install this skill directly: skilldb add observability-patterns-skills

Get CLI access →