Rate Limiting Design
Rate limiting and throttling strategies for protecting distributed systems at scale
You are an expert in Rate Limiting and Throttling for designing scalable distributed systems.
## Key Points
- **Token Bucket**: A bucket holds tokens that refill at a fixed rate. Each request consumes a token. Allows bursts up to bucket capacity while enforcing an average rate.
- **Leaky Bucket**: Requests enter a queue that drains at a constant rate. Smooths out bursts but can add latency.
- **Fixed Window**: Count requests in fixed time windows (e.g., per minute). Simple but allows bursts at window boundaries.
- **Sliding Window Log**: Track timestamps of each request; count those within the sliding window. Accurate but memory-intensive.
- **Sliding Window Counter**: Hybrid of fixed window and sliding log. Weights the previous window's count proportionally. Good balance of accuracy and efficiency.
- Centralized store (Redis) with atomic increment operations.
- Approximate local counters with periodic synchronization.
- Consistent hashing to route a given client's requests to the same limiter node.
- Apply rate limits at the outermost layer possible (edge, API gateway) to shed load before it reaches expensive backend logic.
- Use sliding window counters as the default algorithm — they offer a good tradeoff between accuracy and resource usage without the boundary-burst problem of fixed windows.
- Always return clear rate-limit headers and a meaningful 429 response body so clients can implement proper backoff without guessing.
- Failing to plan for Redis unavailability — decide ahead of time whether to fail open (allow all) or fail closed (deny all) and document that decision.
## Quick Example
```
[Client] --> [CDN/Edge] --> [API Gateway] --> [Service] --> [Database]
rate limit rate limit local limit connection pool
(L7 edge) (per-client) (concurrency) (backpressure)
```skilldb get system-design-skills/Rate Limiting DesignFull skill: 106 linesRate Limiting & Throttling — System Design
You are an expert in Rate Limiting and Throttling for designing scalable distributed systems.
Core Philosophy
Rate limiting is a system's immune response — it protects shared resources from overconsumption, whether caused by traffic spikes, misbehaving clients, or deliberate abuse. Without it, a single bad actor or a sudden viral event can consume all available capacity, degrading the experience for every other user. Rate limiting ensures fairness by making resource consumption explicit and bounded.
The key design tension is between strictness and usability. Limits that are too tight frustrate legitimate users and force unnecessary retry logic. Limits that are too loose fail to protect the system when it matters most. The solution is layered limits — generous per-day quotas for normal usage, stricter per-second limits to prevent bursts, and adaptive tightening when the system is under stress.
Rate limiting should be transparent, not adversarial. Clients need clear signals about their limits, their remaining budget, and when they can retry. A well-implemented rate limiter with proper response headers and error messages turns a frustrating "access denied" into a predictable, programmable contract between the system and its consumers.
Overview
Rate limiting controls how many requests a client or service can make within a time window. It protects backend services from overload, ensures fair resource allocation, prevents abuse, and is a critical layer in any production system handling significant traffic.
Core Concepts
Rate Limiting Layers
[Client] --> [CDN/Edge] --> [API Gateway] --> [Service] --> [Database]
rate limit rate limit local limit connection pool
(L7 edge) (per-client) (concurrency) (backpressure)
Rate limits can be applied at every layer: edge/CDN, load balancer, API gateway, application code, and even at the database connection pool level.
Common Algorithms
- Token Bucket: A bucket holds tokens that refill at a fixed rate. Each request consumes a token. Allows bursts up to bucket capacity while enforcing an average rate.
- Leaky Bucket: Requests enter a queue that drains at a constant rate. Smooths out bursts but can add latency.
- Fixed Window: Count requests in fixed time windows (e.g., per minute). Simple but allows bursts at window boundaries.
- Sliding Window Log: Track timestamps of each request; count those within the sliding window. Accurate but memory-intensive.
- Sliding Window Counter: Hybrid of fixed window and sliding log. Weights the previous window's count proportionally. Good balance of accuracy and efficiency.
Distributed Rate Limiting
When running multiple service instances, rate limit state must be shared. Common approaches:
- Centralized store (Redis) with atomic increment operations.
- Approximate local counters with periodic synchronization.
- Consistent hashing to route a given client's requests to the same limiter node.
Implementation Patterns
Redis-Based Token Bucket
Use Redis INCR with EXPIRE for fixed-window counters, or Lua scripts for atomic token-bucket logic. Redis Cluster provides horizontal scaling for the rate limit store itself.
Hierarchical Rate Limits
Apply multiple limits simultaneously: per-second burst limit, per-minute sustained limit, per-day quota. A request passes only if all limits allow it. This prevents both sudden spikes and sustained abuse.
Adaptive Rate Limiting
Adjust limits dynamically based on system health. When CPU or latency crosses a threshold, tighten limits. When the system recovers, relax them. This is a form of load shedding.
Client-Side Rate Limiting
Well-behaved clients implement local rate limiting with exponential backoff and jitter. This reduces wasted network calls and distributes retry storms over time.
Response Headers
Communicate limits to clients via headers: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset. Return HTTP 429 (Too Many Requests) with a Retry-After header when limits are exceeded.
Trade-offs
| Factor | Strict (centralized) | Approximate (local) |
|---|---|---|
| Accuracy | Exact counts | May slightly over-allow |
| Latency | Redis round-trip per request | No extra network hop |
| Complexity | Requires shared state | Stateless per instance |
| Failure mode | Redis outage = open or closed? | Degrades gracefully |
Use strict centralized limiting for billing-sensitive or abuse-prevention scenarios. Use approximate local limiting for performance-critical paths where slight over-admission is acceptable.
Best Practices
- Apply rate limits at the outermost layer possible (edge, API gateway) to shed load before it reaches expensive backend logic.
- Use sliding window counters as the default algorithm — they offer a good tradeoff between accuracy and resource usage without the boundary-burst problem of fixed windows.
- Always return clear rate-limit headers and a meaningful 429 response body so clients can implement proper backoff without guessing.
Common Pitfalls
- Failing to plan for Redis unavailability — decide ahead of time whether to fail open (allow all) or fail closed (deny all) and document that decision.
- Setting only a per-minute limit without a per-second burst limit, allowing a client to send all allowed requests in a single burst at the start of the window.
Anti-Patterns
-
Silent Rate Limiting: Rejecting requests without returning rate-limit headers or meaningful error messages. Clients have no way to implement proper backoff and resort to blind retries that make the problem worse.
-
Single-Tier Limits Only: Applying only one limit (e.g., 1000 requests per minute) without burst protection. A client can send all 1000 requests in the first second, causing the same backend stress that rate limiting was supposed to prevent.
-
Rate Limiting After Expensive Work: Checking rate limits deep in the request pipeline after authentication, deserialization, and validation have already consumed resources. Limits should be checked as early as possible to shed load before it reaches costly processing.
-
No Fail-Open/Fail-Closed Decision: Deploying a centralized rate limiter (Redis) without defining what happens when the limiter itself is unavailable. The system either silently allows all traffic (fail-open) or blocks everything (fail-closed), and neither is acceptable if unplanned.
-
Uniform Limits Across All Clients: Applying the same limits to free-tier users, paying customers, and internal services. This either under-protects the system from free-tier abuse or unnecessarily throttles high-value traffic.
Install this skill directly: skilldb add system-design-skills
Related Skills
API Gateway Design
API gateway and Backend-for-Frontend (BFF) patterns for managing client-service communication
Circuit Breaker
Circuit breaker and resilience patterns for building fault-tolerant distributed systems
Database Scaling
Database scaling patterns including sharding, replication, and read replicas for high-throughput systems
Distributed Caching
Distributed caching strategies for reducing latency and database load in large-scale systems
Event-Driven
Event-driven architecture and CQRS patterns for reactive, decoupled distributed systems
Message Queues
Message queue patterns including pub/sub, fan-out, and reliable delivery for asynchronous communication