Circuit Breaker
Circuit breaker and resilience patterns for building fault-tolerant distributed systems
You are an expert in Circuit Breaker and Resilience Patterns for designing scalable distributed systems. ## Key Points - **Closed**: Requests flow normally. Failures are counted. When the failure count exceeds a threshold within a time window, the circuit opens. - **Open**: Requests are immediately rejected (fail-fast) without calling the downstream service. After a timeout period, the circuit transitions to half-open. - **Half-Open**: A limited number of probe requests are allowed through. If they succeed, the circuit closes. If they fail, the circuit reopens. - **Timeout**: Set a maximum wait time for every external call. Prevents threads from blocking indefinitely. - **Retry with Backoff**: Retry transient failures with exponential backoff and jitter to avoid thundering herds. - **Bulkhead**: Isolate resources (thread pools, connection pools) per dependency so that one slow service cannot consume all resources. - **Fallback**: When a call fails or the circuit is open, return a degraded response (cached data, default values, or a user-friendly error). - **Load Shedding**: When the system is overloaded, proactively reject low-priority requests to preserve capacity for critical ones. - Product page shows cached reviews when the Reviews Service is down. - Search returns recent popular results when the Search Service is slow. - Dashboard shows "data unavailable" for one widget instead of failing entirely. - Set circuit breaker thresholds based on measured baseline error rates and latencies, not arbitrary numbers; a 1% error rate threshold makes no sense if the service normally has 0.5% errors.
skilldb get system-design-skills/Circuit BreakerFull skill: 118 linesCircuit Breaker & Resilience Patterns — System Design
You are an expert in Circuit Breaker and Resilience Patterns for designing scalable distributed systems.
Core Philosophy
Resilience in distributed systems is not about preventing failures — it is about controlling their blast radius. Every network call is a potential failure point, and the difference between a minor blip and a full outage is whether the system is designed to degrade gracefully or collapse under pressure.
The circuit breaker pattern embodies a fundamental shift in mindset: instead of optimistically retrying until something works, assume that a failing dependency will stay failed for a while and protect the rest of the system from that reality. Fail fast, shed load, and give the downstream service room to recover rather than piling on more requests.
Resilience is a layered discipline. No single pattern is sufficient. Timeouts prevent indefinite waits. Retries with backoff handle transient errors. Bulkheads isolate blast radius. Circuit breakers stop sustained cascading failure. Fallbacks preserve the user experience. These patterns compose — each covers a failure mode the others miss.
Overview
In distributed systems, failures are inevitable. A downstream service may become slow or unresponsive, and without protection, callers will exhaust their resources waiting. The circuit breaker pattern detects failures and prevents cascading outages by short-circuiting calls to unhealthy services. Combined with retries, timeouts, bulkheads, and fallbacks, it forms a comprehensive resilience strategy.
Core Concepts
Circuit Breaker States
failure threshold
[CLOSED] --------------------------> [OPEN]
^ |
| | timeout expires
| v
+---------- success ----------- [HALF-OPEN]
| |
+---------- failure ----------------+
- Closed: Requests flow normally. Failures are counted. When the failure count exceeds a threshold within a time window, the circuit opens.
- Open: Requests are immediately rejected (fail-fast) without calling the downstream service. After a timeout period, the circuit transitions to half-open.
- Half-Open: A limited number of probe requests are allowed through. If they succeed, the circuit closes. If they fail, the circuit reopens.
Resilience Patterns Family
- Timeout: Set a maximum wait time for every external call. Prevents threads from blocking indefinitely.
- Retry with Backoff: Retry transient failures with exponential backoff and jitter to avoid thundering herds.
- Bulkhead: Isolate resources (thread pools, connection pools) per dependency so that one slow service cannot consume all resources.
- Fallback: When a call fails or the circuit is open, return a degraded response (cached data, default values, or a user-friendly error).
- Load Shedding: When the system is overloaded, proactively reject low-priority requests to preserve capacity for critical ones.
Implementation Patterns
Library-Based Circuit Breakers
Use libraries like Resilience4j (Java), Polly (.NET), or Hystrix (legacy). These wrap outgoing calls with circuit breaker logic, configurable thresholds, and metrics.
CircuitBreaker config:
failure_rate_threshold: 50%
slow_call_threshold: 80%
wait_duration_in_open_state: 30s
permitted_calls_in_half_open: 5
sliding_window_size: 100 requests
Service Mesh Circuit Breaking
Istio and Linkerd implement circuit breaking at the proxy level. No application code changes needed. The sidecar proxy tracks error rates and opens the circuit based on configured thresholds. Works across languages and frameworks.
Retry Budget
Instead of configuring retries per call, set a retry budget for the entire service: "no more than 10% of requests should be retries." This prevents retry amplification where each layer retries, multiplying the total load on the failing service.
Graceful Degradation
Design the system to function in a degraded mode when dependencies fail:
- Product page shows cached reviews when the Reviews Service is down.
- Search returns recent popular results when the Search Service is slow.
- Dashboard shows "data unavailable" for one widget instead of failing entirely.
Health Check and Readiness Probes
Kubernetes liveness and readiness probes detect unhealthy instances. A service that has its circuit breaker open for too many downstream dependencies can mark itself as not-ready, allowing the load balancer to shift traffic to healthier instances.
Trade-offs
| Factor | With Circuit Breaker | Without Circuit Breaker |
|---|---|---|
| Cascade prevention | Yes — fail-fast | No — failures propagate |
| Resource usage | Protected (bulkheads, timeouts) | Threads/connections exhausted |
| Latency during outage | Immediate failure (fast) | Hangs until timeout |
| Complexity | Configuration and tuning needed | Simpler code |
| False positives | Possible — circuit opens too aggressively | N/A |
Circuit breakers are essential for any service that calls other services over the network. The configuration tuning effort is far outweighed by the protection against cascading failures.
Best Practices
- Combine circuit breakers with timeouts and retries as a layered defense: timeout prevents indefinite waits, retries handle transient errors, and the circuit breaker prevents sustained calls to a broken dependency.
- Set circuit breaker thresholds based on measured baseline error rates and latencies, not arbitrary numbers; a 1% error rate threshold makes no sense if the service normally has 0.5% errors.
- Always provide a fallback behavior when the circuit is open — even if it is just a clear error message, it is better than a hanging request or a raw exception propagating to the user.
Common Pitfalls
- Retrying without backoff and jitter, which creates a thundering herd that overwhelms the recovering service and prevents it from stabilizing.
- Setting the half-open probe count too high, which floods a fragile recovering service with traffic and causes the circuit to reopen immediately, trapping it in a failure loop.
Anti-Patterns
-
Retry Storm Amplification: Every layer in the call chain retries independently — the gateway retries 3 times, each calling a service that retries 3 times, resulting in 9x the load on the failing dependency. Use retry budgets to cap total retry traffic.
-
Circuit Breaker Without Fallback: Opening the circuit and returning a raw 503 error to the user. The circuit breaker protects the backend, but without a fallback (cached data, default response, graceful degradation) the user experience still breaks.
-
One-Size-Fits-All Thresholds: Applying identical circuit breaker settings to all dependencies regardless of their error profiles. A payment service and a recommendation service have very different acceptable failure rates and recovery times.
-
Ignoring Partial Failures: Treating any downstream error as a reason to open the circuit, including expected client errors (4xx). Circuit breakers should trigger on server errors and timeouts, not on validation failures.
-
Testing Only the Happy Path: Never exercising the circuit open or half-open states in staging or chaos testing. The first time the circuit breaker fires in production should not be the first time anyone sees how the system behaves in that state.
Install this skill directly: skilldb add system-design-skills
Related Skills
API Gateway Design
API gateway and Backend-for-Frontend (BFF) patterns for managing client-service communication
Database Scaling
Database scaling patterns including sharding, replication, and read replicas for high-throughput systems
Distributed Caching
Distributed caching strategies for reducing latency and database load in large-scale systems
Event-Driven
Event-driven architecture and CQRS patterns for reactive, decoupled distributed systems
Message Queues
Message queue patterns including pub/sub, fan-out, and reliable delivery for asynchronous communication
Microservices
Microservices architecture patterns for building independently deployable, loosely coupled services