Skip to main content
Technology & EngineeringSystem Design106 lines

Rate Limiting Design

Rate limiting and throttling strategies for protecting distributed systems at scale

Quick Summary26 lines
You are an expert in Rate Limiting and Throttling for designing scalable distributed systems.

## Key Points

- **Token Bucket**: A bucket holds tokens that refill at a fixed rate. Each request consumes a token. Allows bursts up to bucket capacity while enforcing an average rate.
- **Leaky Bucket**: Requests enter a queue that drains at a constant rate. Smooths out bursts but can add latency.
- **Fixed Window**: Count requests in fixed time windows (e.g., per minute). Simple but allows bursts at window boundaries.
- **Sliding Window Log**: Track timestamps of each request; count those within the sliding window. Accurate but memory-intensive.
- **Sliding Window Counter**: Hybrid of fixed window and sliding log. Weights the previous window's count proportionally. Good balance of accuracy and efficiency.
- Centralized store (Redis) with atomic increment operations.
- Approximate local counters with periodic synchronization.
- Consistent hashing to route a given client's requests to the same limiter node.
- Apply rate limits at the outermost layer possible (edge, API gateway) to shed load before it reaches expensive backend logic.
- Use sliding window counters as the default algorithm — they offer a good tradeoff between accuracy and resource usage without the boundary-burst problem of fixed windows.
- Always return clear rate-limit headers and a meaningful 429 response body so clients can implement proper backoff without guessing.
- Failing to plan for Redis unavailability — decide ahead of time whether to fail open (allow all) or fail closed (deny all) and document that decision.

## Quick Example

```
[Client] --> [CDN/Edge] --> [API Gateway] --> [Service] --> [Database]
              rate limit     rate limit       local limit    connection pool
              (L7 edge)     (per-client)     (concurrency)  (backpressure)
```
skilldb get system-design-skills/Rate Limiting DesignFull skill: 106 lines
Paste into your CLAUDE.md or agent config

Rate Limiting & Throttling — System Design

You are an expert in Rate Limiting and Throttling for designing scalable distributed systems.

Core Philosophy

Rate limiting is a system's immune response — it protects shared resources from overconsumption, whether caused by traffic spikes, misbehaving clients, or deliberate abuse. Without it, a single bad actor or a sudden viral event can consume all available capacity, degrading the experience for every other user. Rate limiting ensures fairness by making resource consumption explicit and bounded.

The key design tension is between strictness and usability. Limits that are too tight frustrate legitimate users and force unnecessary retry logic. Limits that are too loose fail to protect the system when it matters most. The solution is layered limits — generous per-day quotas for normal usage, stricter per-second limits to prevent bursts, and adaptive tightening when the system is under stress.

Rate limiting should be transparent, not adversarial. Clients need clear signals about their limits, their remaining budget, and when they can retry. A well-implemented rate limiter with proper response headers and error messages turns a frustrating "access denied" into a predictable, programmable contract between the system and its consumers.

Overview

Rate limiting controls how many requests a client or service can make within a time window. It protects backend services from overload, ensures fair resource allocation, prevents abuse, and is a critical layer in any production system handling significant traffic.

Core Concepts

Rate Limiting Layers

[Client] --> [CDN/Edge] --> [API Gateway] --> [Service] --> [Database]
              rate limit     rate limit       local limit    connection pool
              (L7 edge)     (per-client)     (concurrency)  (backpressure)

Rate limits can be applied at every layer: edge/CDN, load balancer, API gateway, application code, and even at the database connection pool level.

Common Algorithms

  • Token Bucket: A bucket holds tokens that refill at a fixed rate. Each request consumes a token. Allows bursts up to bucket capacity while enforcing an average rate.
  • Leaky Bucket: Requests enter a queue that drains at a constant rate. Smooths out bursts but can add latency.
  • Fixed Window: Count requests in fixed time windows (e.g., per minute). Simple but allows bursts at window boundaries.
  • Sliding Window Log: Track timestamps of each request; count those within the sliding window. Accurate but memory-intensive.
  • Sliding Window Counter: Hybrid of fixed window and sliding log. Weights the previous window's count proportionally. Good balance of accuracy and efficiency.

Distributed Rate Limiting

When running multiple service instances, rate limit state must be shared. Common approaches:

  • Centralized store (Redis) with atomic increment operations.
  • Approximate local counters with periodic synchronization.
  • Consistent hashing to route a given client's requests to the same limiter node.

Implementation Patterns

Redis-Based Token Bucket

Use Redis INCR with EXPIRE for fixed-window counters, or Lua scripts for atomic token-bucket logic. Redis Cluster provides horizontal scaling for the rate limit store itself.

Hierarchical Rate Limits

Apply multiple limits simultaneously: per-second burst limit, per-minute sustained limit, per-day quota. A request passes only if all limits allow it. This prevents both sudden spikes and sustained abuse.

Adaptive Rate Limiting

Adjust limits dynamically based on system health. When CPU or latency crosses a threshold, tighten limits. When the system recovers, relax them. This is a form of load shedding.

Client-Side Rate Limiting

Well-behaved clients implement local rate limiting with exponential backoff and jitter. This reduces wasted network calls and distributes retry storms over time.

Response Headers

Communicate limits to clients via headers: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset. Return HTTP 429 (Too Many Requests) with a Retry-After header when limits are exceeded.

Trade-offs

FactorStrict (centralized)Approximate (local)
AccuracyExact countsMay slightly over-allow
LatencyRedis round-trip per requestNo extra network hop
ComplexityRequires shared stateStateless per instance
Failure modeRedis outage = open or closed?Degrades gracefully

Use strict centralized limiting for billing-sensitive or abuse-prevention scenarios. Use approximate local limiting for performance-critical paths where slight over-admission is acceptable.

Best Practices

  • Apply rate limits at the outermost layer possible (edge, API gateway) to shed load before it reaches expensive backend logic.
  • Use sliding window counters as the default algorithm — they offer a good tradeoff between accuracy and resource usage without the boundary-burst problem of fixed windows.
  • Always return clear rate-limit headers and a meaningful 429 response body so clients can implement proper backoff without guessing.

Common Pitfalls

  • Failing to plan for Redis unavailability — decide ahead of time whether to fail open (allow all) or fail closed (deny all) and document that decision.
  • Setting only a per-minute limit without a per-second burst limit, allowing a client to send all allowed requests in a single burst at the start of the window.

Anti-Patterns

  • Silent Rate Limiting: Rejecting requests without returning rate-limit headers or meaningful error messages. Clients have no way to implement proper backoff and resort to blind retries that make the problem worse.

  • Single-Tier Limits Only: Applying only one limit (e.g., 1000 requests per minute) without burst protection. A client can send all 1000 requests in the first second, causing the same backend stress that rate limiting was supposed to prevent.

  • Rate Limiting After Expensive Work: Checking rate limits deep in the request pipeline after authentication, deserialization, and validation have already consumed resources. Limits should be checked as early as possible to shed load before it reaches costly processing.

  • No Fail-Open/Fail-Closed Decision: Deploying a centralized rate limiter (Redis) without defining what happens when the limiter itself is unavailable. The system either silently allows all traffic (fail-open) or blocks everything (fail-closed), and neither is acceptable if unplanned.

  • Uniform Limits Across All Clients: Applying the same limits to free-tier users, paying customers, and internal services. This either under-protects the system from free-tier abuse or unnecessarily throttles high-value traffic.

Install this skill directly: skilldb add system-design-skills

Get CLI access →