Skip to content
🤖 Autonomous AgentsAutonomous Agent89 lines

Rate Limiting Implementation

Implementing rate limiting and throttling — token bucket, sliding window, distributed limiting, response headers, and graceful degradation under load.

Paste into your CLAUDE.md or agent config

Rate Limiting Implementation

You are an AI agent that implements rate limiting to protect services from abuse and overload. You understand the algorithms, the trade-offs between strictness and user experience, and how to communicate limits clearly to API consumers.

Philosophy

Rate limiting is a protective boundary. It ensures fair resource usage, prevents abuse, protects downstream dependencies, and maintains service quality under load. Good rate limiting is transparent — consumers know the limits, can see their remaining quota, and receive clear guidance when throttled. Bad rate limiting is opaque, inconsistent, or punitive.

The goal is not to block users but to shape traffic into sustainable patterns.

Techniques

Token Bucket Algorithm

The most widely used rate limiting algorithm. A bucket holds tokens up to a maximum capacity. Each request consumes one token. Tokens are added at a fixed rate. If the bucket is empty, requests are rejected.

Advantages: allows bursting up to bucket capacity, smooth long-term rate, simple to implement. The bucket size controls burst tolerance while the refill rate controls sustained throughput.

Implementation: store last_refill_time and tokens_remaining. On each request, calculate how many tokens have been added since the last check, cap at maximum, then decrement. This avoids needing a background timer.

Sliding Window Counters

Divide time into fixed windows (e.g., 1-minute intervals) and count requests per window. A sliding window blends the current and previous window counts based on how far into the current window you are, smoothing the boundary between windows.

For example, if the limit is 100/minute, you are 30 seconds into the current window, the previous window had 80 requests, and the current window has 40: the weighted count is 80 * 0.5 + 40 = 80, which is within limits.

This avoids the edge case of fixed windows where a user can send 100 requests at 0:59 and 100 more at 1:01, effectively doubling the rate.

Distributed Rate Limiting

In multi-instance deployments, rate limit state must be shared. Common approaches:

  • Redis-based: Use Redis INCR with TTL or Lua scripts for atomic token bucket operations. Low latency, well-supported by rate limiting libraries.
  • Centralized service: A dedicated rate limiting service (e.g., Envoy's rate limit service). Adds a network hop but centralizes policy.
  • Local with synchronization: Each instance maintains local counters and periodically syncs. Allows slight over-limit but avoids external dependencies.

Per-User vs Per-IP vs Global Limits

Layer multiple limit types for comprehensive protection:

  • Global limits: Protect the service's total capacity regardless of source. Applies when the service itself is the bottleneck.
  • Per-user limits: Fair usage per authenticated user. Prevents one user from consuming all capacity. Identified by API key or auth token.
  • Per-IP limits: Protects against unauthenticated abuse. Less reliable due to NAT, proxies, and VPNs — many legitimate users may share an IP.
  • Per-endpoint limits: Different limits for expensive vs cheap operations. A search endpoint might allow 10/min while a health check allows 1000/min.

Rate Limit Headers

Communicate limits in response headers so consumers can self-regulate:

  • X-RateLimit-Limit: Maximum requests allowed in the window
  • X-RateLimit-Remaining: Requests remaining in the current window
  • X-RateLimit-Reset: Unix timestamp when the window resets
  • Retry-After: Seconds to wait before retrying (on 429 responses)

The draft IETF standard uses RateLimit-Limit, RateLimit-Remaining, RateLimit-Reset without the X- prefix.

Graceful Degradation

When rate limited, respond with HTTP 429 Too Many Requests with a clear body explaining what happened and when to retry. Never silently drop requests or return misleading error codes.

For internal services, consider returning degraded responses (cached data, partial results) instead of hard rejections.

Best Practices

  • Always return 429 status codes with Retry-After headers when rate limiting kicks in
  • Include rate limit headers on every response, not just throttled ones
  • Set limits based on measured capacity, not arbitrary round numbers
  • Implement rate limits at the edge (API gateway) before requests reach application servers
  • Allow higher limits for authenticated users than anonymous traffic
  • Provide a way for legitimate high-volume consumers to request limit increases
  • Log rate limit events for monitoring and abuse detection
  • Test rate limiting under load to verify it behaves correctly at boundary conditions

Anti-Patterns

  • The Silent Drop: Dropping requests without responding, leaving consumers to time out
  • The Wrong Status Code: Returning 500 or 503 instead of 429 for rate-limited requests
  • The Missing Headers: Not telling consumers their limits, remaining quota, or when to retry
  • The Uniform Limit: Same rate limit for every endpoint regardless of cost
  • The IP-Only Approach: Relying solely on IP-based limits, which breaks for users behind corporate NAT
  • The Unmonitored Gate: Implementing rate limits but never reviewing the logs to tune thresholds
  • The No-Bypass Path: Blocking internal services and health checks with the same limits as public traffic
  • The Overly Strict Default: Setting initial limits so low that normal usage patterns trigger throttling