Rate Limiting and API Throttling: Protecting Services at Scale
Implementation patterns for rate limiting and API throttling — covering algorithms, distributed rate limiting, quota management, and graceful degradation.
Rate limiting is a critical mechanism for protecting services from abuse, ensuring fair access, and maintaining system stability under load. Without it, a single client can monopolize resources, a bug can trigger a DDoS-like cascade, or a viral moment can bring down your entire platform.
Why Rate Limiting Matters
- Protect backend services from being overwhelmed
- Ensure fair usage across tenants and API consumers
- Prevent abuse from scrapers, bots, and malicious actors
- Control costs for metered infrastructure (cloud, third-party APIs)
- Enable SLA enforcement for tiered pricing plans
Rate Limiting Algorithms
1. Fixed Window Counter
The simplest approach. Count requests in fixed time windows.
Window: 1 minute
Limit: 100 requests
Counter resets at the top of each minute
Timeline:
12:00:00 ────────────── 12:01:00 ────────────── 12:02:00
[ 100 allowed ] [ 100 allowed ]
Problem: Boundary burst. A client could send 100 requests at 12:00:59 and 100 more at 12:01:00 — 200 requests in 2 seconds.
2. Sliding Window Log
Track the timestamp of every request. Count requests within the sliding window.
def is_allowed(client_id, limit, window_seconds):
now = time.time()
key = f"rate:{client_id}"
# Remove entries outside the window
redis.zremrangebyscore(key, 0, now - window_seconds)
# Count requests in window
count = redis.zcard(key)
if count < limit:
redis.zadd(key, {str(now): now})
redis.expire(key, window_seconds)
return True
return False
Pros: Exact window, no boundary burst. Cons: High memory usage — stores every timestamp.
3. Sliding Window Counter
Hybrid approach: use the previous window’s count weighted by overlap.
Previous window count: 84 (60% of window remaining)
Current window count: 36
Estimated rate: 84 × 0.6 + 36 = 86.4
Limit: 100 → Allowed
Pros: Low memory, no boundary burst, constant-time lookup. Cons: Approximate (but close enough for production).
4. Token Bucket
A bucket holds tokens that refill at a constant rate. Each request consumes one token.
class TokenBucket:
def __init__(self, capacity, refill_rate):
self.capacity = capacity
self.tokens = capacity
self.refill_rate = refill_rate # tokens per second
self.last_refill = time.time()
def consume(self, tokens=1):
self._refill()
if self.tokens >= tokens:
self.tokens -= tokens
return True
return False
def _refill(self):
now = time.time()
elapsed = now - self.last_refill
self.tokens = min(self.capacity,
self.tokens + elapsed * self.refill_rate)
self.last_refill = now
Pros: Allows controlled bursts (up to bucket capacity). Smooth average rate. Cons: Slightly more complex implementation.
5. Leaky Bucket
Requests enter a queue (bucket) and are processed at a constant rate. Overflow is dropped.
Pros: Perfectly smooth output rate. Great for downstream services with fixed capacity. Cons: Adds latency (queuing). Drops requests during bursts.
Algorithm Comparison
| Algorithm | Burst Handling | Memory | Accuracy | Complexity |
|---|---|---|---|---|
| Fixed Window | Poor (boundary) | O(1) | Approximate | Simple |
| Sliding Log | None (exact) | O(n) | Exact | Medium |
| Sliding Counter | Good | O(1) | Near-exact | Medium |
| Token Bucket | Controlled | O(1) | Good | Medium |
| Leaky Bucket | None (queued) | O(n) | Exact | Medium |
Distributed Rate Limiting
Single-instance rate limiting doesn’t work when you have multiple API gateway instances. Every instance needs to share state.
Centralized Counter (Redis)
API Gateway 1 ┐
API Gateway 2 ├──→ Redis (shared counter) ──→ Allow/Deny
API Gateway 3 ┘
Implementation:
-- Redis Lua script for atomic rate limiting
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local current = tonumber(redis.call('GET', key) or '0')
if current < limit then
redis.call('INCR', key)
redis.call('EXPIRE', key, window)
return 1 -- allowed
end
return 0 -- denied
Local + Sync
Each instance maintains a local counter and periodically syncs with a central store:
- Lower latency (no network call per request)
- Eventually consistent (may over-allow during sync gaps)
- Acceptable for most use cases
Response Headers
Always communicate rate limit status to clients:
HTTP/1.1 200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 34
X-RateLimit-Reset: 1625097600
Retry-After: 42
When rate limited:
HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1625097600
Retry-After: 42
Content-Type: application/json
{"error": "Rate limit exceeded", "retry_after": 42}
Multi-Dimensional Rate Limiting
Real-world APIs limit on multiple dimensions simultaneously:
| Dimension | Limit | Scope |
|---|---|---|
| Per API key | 1,000/min | Consumer-level |
| Per IP address | 100/min | Abuse prevention |
| Per endpoint | 50/min | Protect expensive operations |
| Per tenant | 10,000/min | SaaS multi-tenancy |
| Global | 100,000/min | System-wide protection |
All limits must pass for a request to proceed.
Graceful Degradation
Instead of hard rejection (429), consider graceful degradation:
- Serve cached responses — Return stale data instead of no data
- Reduce response quality — Skip expensive computations, return simplified results
- Queue and process later — Accept the request but process it asynchronously
- Shed non-critical requests — Prioritize core functionality over nice-to-haves
Anti-Patterns
Client-Side Rate Limiting Only
Trusting the client to rate-limit itself is like trusting drivers to enforce their own speed limits.
No Rate Limiting on Internal APIs
Internal services can cause cascading failures just like external clients. Rate limit service-to-service calls too.
Overly Aggressive Limits
If legitimate users are regularly hitting limits, the limits are too low. Monitor 429 rates by client and adjust.
No Retry-After Header
Clients need to know when they can retry. Without Retry-After, they’ll retry immediately — making the problem worse.
Rate Limiting Without Monitoring
If you can’t see who’s being rate-limited and why, you can’t tune the limits. Dashboard your 429 rates by client, endpoint, and time.
Production Checklist
- Rate limit headers on every response
- 429 responses include Retry-After
- Distributed rate limiting across all instances
- Multi-dimensional limits (key + IP + endpoint)
- Dashboard for 429 rates by client and endpoint
- Alerting when legitimate clients hit limits
- Graceful degradation for high-value clients
- Documentation of limits in API docs
- Rate limit bypass for internal health checks
- Load test to verify limits hold under pressure