Rate Limiting and API Throttling: Protecting Services at Scale

Rate limiting is a critical mechanism for protecting services from abuse, ensuring fair access, and maintaining system stability under load. Without it, a single client can monopolize resources, a bug can trigger a DDoS-like cascade, or a viral moment can bring down your entire platform.

Why Rate Limiting Matters

Protect backend services from being overwhelmed
Ensure fair usage across tenants and API consumers
Prevent abuse from scrapers, bots, and malicious actors
Control costs for metered infrastructure (cloud, third-party APIs)
Enable SLA enforcement for tiered pricing plans

Rate Limiting Algorithms

1. Fixed Window Counter

The simplest approach. Count requests in fixed time windows.

Window: 1 minute
Limit: 100 requests
Counter resets at the top of each minute

Timeline:
12:00:00 ────────────── 12:01:00 ────────────── 12:02:00
    [     100 allowed     ]  [   100 allowed     ]

Problem: Boundary burst. A client could send 100 requests at 12:00:59 and 100 more at 12:01:00 — 200 requests in 2 seconds.

2. Sliding Window Log

Track the timestamp of every request. Count requests within the sliding window.

def is_allowed(client_id, limit, window_seconds):
    now = time.time()
    key = f"rate:{client_id}"
    
    # Remove entries outside the window
    redis.zremrangebyscore(key, 0, now - window_seconds)
    
    # Count requests in window
    count = redis.zcard(key)
    
    if count < limit:
        redis.zadd(key, {str(now): now})
        redis.expire(key, window_seconds)
        return True
    return False

Pros: Exact window, no boundary burst. Cons: High memory usage — stores every timestamp.

3. Sliding Window Counter

Hybrid approach: use the previous window’s count weighted by overlap.

Previous window count: 84 (60% of window remaining)
Current window count: 36
Estimated rate: 84 × 0.6 + 36 = 86.4
Limit: 100 → Allowed

Pros: Low memory, no boundary burst, constant-time lookup. Cons: Approximate (but close enough for production).

4. Token Bucket

A bucket holds tokens that refill at a constant rate. Each request consumes one token.

class TokenBucket:
    def __init__(self, capacity, refill_rate):
        self.capacity = capacity
        self.tokens = capacity
        self.refill_rate = refill_rate  # tokens per second
        self.last_refill = time.time()
    
    def consume(self, tokens=1):
        self._refill()
        if self.tokens >= tokens:
            self.tokens -= tokens
            return True
        return False
    
    def _refill(self):
        now = time.time()
        elapsed = now - self.last_refill
        self.tokens = min(self.capacity, 
                         self.tokens + elapsed * self.refill_rate)
        self.last_refill = now

Pros: Allows controlled bursts (up to bucket capacity). Smooth average rate. Cons: Slightly more complex implementation.

5. Leaky Bucket

Requests enter a queue (bucket) and are processed at a constant rate. Overflow is dropped.

Pros: Perfectly smooth output rate. Great for downstream services with fixed capacity. Cons: Adds latency (queuing). Drops requests during bursts.

Algorithm Comparison

Algorithm	Burst Handling	Memory	Accuracy	Complexity
Fixed Window	Poor (boundary)	O(1)	Approximate	Simple
Sliding Log	None (exact)	O(n)	Exact	Medium
Sliding Counter	Good	O(1)	Near-exact	Medium
Token Bucket	Controlled	O(1)	Good	Medium
Leaky Bucket	None (queued)	O(n)	Exact	Medium

Distributed Rate Limiting

Single-instance rate limiting doesn’t work when you have multiple API gateway instances. Every instance needs to share state.

Centralized Counter (Redis)

API Gateway 1 ┐
API Gateway 2 ├──→ Redis (shared counter) ──→ Allow/Deny
API Gateway 3 ┘

Implementation:

-- Redis Lua script for atomic rate limiting
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])

local current = tonumber(redis.call('GET', key) or '0')
if current < limit then
    redis.call('INCR', key)
    redis.call('EXPIRE', key, window)
    return 1  -- allowed
end
return 0  -- denied

Local + Sync

Each instance maintains a local counter and periodically syncs with a central store:

Lower latency (no network call per request)
Eventually consistent (may over-allow during sync gaps)
Acceptable for most use cases

Response Headers

Always communicate rate limit status to clients:

HTTP/1.1 200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 34
X-RateLimit-Reset: 1625097600
Retry-After: 42

When rate limited:

HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1625097600
Retry-After: 42
Content-Type: application/json

{"error": "Rate limit exceeded", "retry_after": 42}

Multi-Dimensional Rate Limiting

Real-world APIs limit on multiple dimensions simultaneously:

Dimension	Limit	Scope
Per API key	1,000/min	Consumer-level
Per IP address	100/min	Abuse prevention
Per endpoint	50/min	Protect expensive operations
Per tenant	10,000/min	SaaS multi-tenancy
Global	100,000/min	System-wide protection

All limits must pass for a request to proceed.

Graceful Degradation

Instead of hard rejection (429), consider graceful degradation:

Serve cached responses — Return stale data instead of no data
Reduce response quality — Skip expensive computations, return simplified results
Queue and process later — Accept the request but process it asynchronously
Shed non-critical requests — Prioritize core functionality over nice-to-haves

Anti-Patterns

Client-Side Rate Limiting Only

Trusting the client to rate-limit itself is like trusting drivers to enforce their own speed limits.

No Rate Limiting on Internal APIs

Internal services can cause cascading failures just like external clients. Rate limit service-to-service calls too.

Overly Aggressive Limits

If legitimate users are regularly hitting limits, the limits are too low. Monitor 429 rates by client and adjust.

No Retry-After Header

Clients need to know when they can retry. Without Retry-After, they’ll retry immediately — making the problem worse.

Rate Limiting Without Monitoring

If you can’t see who’s being rate-limited and why, you can’t tune the limits. Dashboard your 429 rates by client, endpoint, and time.