Skip to main content
Technology & EngineeringSystem Design118 lines

Circuit Breaker

Circuit breaker and resilience patterns for building fault-tolerant distributed systems

Quick Summary18 lines
You are an expert in Circuit Breaker and Resilience Patterns for designing scalable distributed systems.

## Key Points

- **Closed**: Requests flow normally. Failures are counted. When the failure count exceeds a threshold within a time window, the circuit opens.
- **Open**: Requests are immediately rejected (fail-fast) without calling the downstream service. After a timeout period, the circuit transitions to half-open.
- **Half-Open**: A limited number of probe requests are allowed through. If they succeed, the circuit closes. If they fail, the circuit reopens.
- **Timeout**: Set a maximum wait time for every external call. Prevents threads from blocking indefinitely.
- **Retry with Backoff**: Retry transient failures with exponential backoff and jitter to avoid thundering herds.
- **Bulkhead**: Isolate resources (thread pools, connection pools) per dependency so that one slow service cannot consume all resources.
- **Fallback**: When a call fails or the circuit is open, return a degraded response (cached data, default values, or a user-friendly error).
- **Load Shedding**: When the system is overloaded, proactively reject low-priority requests to preserve capacity for critical ones.
- Product page shows cached reviews when the Reviews Service is down.
- Search returns recent popular results when the Search Service is slow.
- Dashboard shows "data unavailable" for one widget instead of failing entirely.
- Set circuit breaker thresholds based on measured baseline error rates and latencies, not arbitrary numbers; a 1% error rate threshold makes no sense if the service normally has 0.5% errors.
skilldb get system-design-skills/Circuit BreakerFull skill: 118 lines
Paste into your CLAUDE.md or agent config

Circuit Breaker & Resilience Patterns — System Design

You are an expert in Circuit Breaker and Resilience Patterns for designing scalable distributed systems.

Core Philosophy

Resilience in distributed systems is not about preventing failures — it is about controlling their blast radius. Every network call is a potential failure point, and the difference between a minor blip and a full outage is whether the system is designed to degrade gracefully or collapse under pressure.

The circuit breaker pattern embodies a fundamental shift in mindset: instead of optimistically retrying until something works, assume that a failing dependency will stay failed for a while and protect the rest of the system from that reality. Fail fast, shed load, and give the downstream service room to recover rather than piling on more requests.

Resilience is a layered discipline. No single pattern is sufficient. Timeouts prevent indefinite waits. Retries with backoff handle transient errors. Bulkheads isolate blast radius. Circuit breakers stop sustained cascading failure. Fallbacks preserve the user experience. These patterns compose — each covers a failure mode the others miss.

Overview

In distributed systems, failures are inevitable. A downstream service may become slow or unresponsive, and without protection, callers will exhaust their resources waiting. The circuit breaker pattern detects failures and prevents cascading outages by short-circuiting calls to unhealthy services. Combined with retries, timeouts, bulkheads, and fallbacks, it forms a comprehensive resilience strategy.

Core Concepts

Circuit Breaker States

                    failure threshold
    [CLOSED] --------------------------> [OPEN]
       ^                                    |
       |                                    | timeout expires
       |                                    v
       +---------- success ----------- [HALF-OPEN]
       |                                    |
       +---------- failure ----------------+
  • Closed: Requests flow normally. Failures are counted. When the failure count exceeds a threshold within a time window, the circuit opens.
  • Open: Requests are immediately rejected (fail-fast) without calling the downstream service. After a timeout period, the circuit transitions to half-open.
  • Half-Open: A limited number of probe requests are allowed through. If they succeed, the circuit closes. If they fail, the circuit reopens.

Resilience Patterns Family

  • Timeout: Set a maximum wait time for every external call. Prevents threads from blocking indefinitely.
  • Retry with Backoff: Retry transient failures with exponential backoff and jitter to avoid thundering herds.
  • Bulkhead: Isolate resources (thread pools, connection pools) per dependency so that one slow service cannot consume all resources.
  • Fallback: When a call fails or the circuit is open, return a degraded response (cached data, default values, or a user-friendly error).
  • Load Shedding: When the system is overloaded, proactively reject low-priority requests to preserve capacity for critical ones.

Implementation Patterns

Library-Based Circuit Breakers

Use libraries like Resilience4j (Java), Polly (.NET), or Hystrix (legacy). These wrap outgoing calls with circuit breaker logic, configurable thresholds, and metrics.

CircuitBreaker config:
  failure_rate_threshold: 50%
  slow_call_threshold: 80%
  wait_duration_in_open_state: 30s
  permitted_calls_in_half_open: 5
  sliding_window_size: 100 requests

Service Mesh Circuit Breaking

Istio and Linkerd implement circuit breaking at the proxy level. No application code changes needed. The sidecar proxy tracks error rates and opens the circuit based on configured thresholds. Works across languages and frameworks.

Retry Budget

Instead of configuring retries per call, set a retry budget for the entire service: "no more than 10% of requests should be retries." This prevents retry amplification where each layer retries, multiplying the total load on the failing service.

Graceful Degradation

Design the system to function in a degraded mode when dependencies fail:

  • Product page shows cached reviews when the Reviews Service is down.
  • Search returns recent popular results when the Search Service is slow.
  • Dashboard shows "data unavailable" for one widget instead of failing entirely.

Health Check and Readiness Probes

Kubernetes liveness and readiness probes detect unhealthy instances. A service that has its circuit breaker open for too many downstream dependencies can mark itself as not-ready, allowing the load balancer to shift traffic to healthier instances.

Trade-offs

FactorWith Circuit BreakerWithout Circuit Breaker
Cascade preventionYes — fail-fastNo — failures propagate
Resource usageProtected (bulkheads, timeouts)Threads/connections exhausted
Latency during outageImmediate failure (fast)Hangs until timeout
ComplexityConfiguration and tuning neededSimpler code
False positivesPossible — circuit opens too aggressivelyN/A

Circuit breakers are essential for any service that calls other services over the network. The configuration tuning effort is far outweighed by the protection against cascading failures.

Best Practices

  • Combine circuit breakers with timeouts and retries as a layered defense: timeout prevents indefinite waits, retries handle transient errors, and the circuit breaker prevents sustained calls to a broken dependency.
  • Set circuit breaker thresholds based on measured baseline error rates and latencies, not arbitrary numbers; a 1% error rate threshold makes no sense if the service normally has 0.5% errors.
  • Always provide a fallback behavior when the circuit is open — even if it is just a clear error message, it is better than a hanging request or a raw exception propagating to the user.

Common Pitfalls

  • Retrying without backoff and jitter, which creates a thundering herd that overwhelms the recovering service and prevents it from stabilizing.
  • Setting the half-open probe count too high, which floods a fragile recovering service with traffic and causes the circuit to reopen immediately, trapping it in a failure loop.

Anti-Patterns

  • Retry Storm Amplification: Every layer in the call chain retries independently — the gateway retries 3 times, each calling a service that retries 3 times, resulting in 9x the load on the failing dependency. Use retry budgets to cap total retry traffic.

  • Circuit Breaker Without Fallback: Opening the circuit and returning a raw 503 error to the user. The circuit breaker protects the backend, but without a fallback (cached data, default response, graceful degradation) the user experience still breaks.

  • One-Size-Fits-All Thresholds: Applying identical circuit breaker settings to all dependencies regardless of their error profiles. A payment service and a recommendation service have very different acceptable failure rates and recovery times.

  • Ignoring Partial Failures: Treating any downstream error as a reason to open the circuit, including expected client errors (4xx). Circuit breakers should trigger on server errors and timeouts, not on validation failures.

  • Testing Only the Happy Path: Never exercising the circuit open or half-open states in staging or chaos testing. The first time the circuit breaker fires in production should not be the first time anyone sees how the system behaves in that state.

Install this skill directly: skilldb add system-design-skills

Get CLI access →