What is the right rate limit for an API?

There is no universal answer — it depends on whether the limit is meant to prevent abuse (then it can be permissive for normal users), to enforce a billing tier (then it follows the contract), or to protect a backend resource (then it follows the resource's capacity). A common starting point for general anti-abuse on a public API is 60 requests per minute per IP, with a burst capacity of 100 requests, scaled up or down based on real traffic patterns once you have them.

Should I rate-limit by IP or by API key?

Both, for different reasons. By API key prevents one customer from saturating their own quota. By IP prevents one IP from saturating the system regardless of whether they have an API key. Authenticated traffic is best-rate-limited by user/key; unauthenticated traffic is best-rate-limited by IP because that is the only stable identifier you have.

How do I rate-limit fairly when many users share one IP?

Treat IP as a coarse signal and pair it with another (account, session, behavior pattern). On CGNAT and mobile carrier IPs, a strict per-IP limit will harm many users in one go. Practical approach: set the per-IP limit at a level that allows for moderate sharing (3-5x what a single user would do), then enforce a tighter per-account or per-session limit on top. For unauthenticated abuse, the IP limit catches the worst offenders without locking out everyone behind the same NAT.

What rate-limiting algorithm should I use?

Token bucket for most API rate-limiting — it allows controlled bursts while enforcing an average rate, and the math fits in a single Redis INCR or similar primitive. Sliding window log for cases where exact request counts in a recent window matter (e.g., legal compliance). Leaky bucket for output-rate limiting where smoothing matters more than allowing bursts. Fixed window is the simplest and worst — boundary-crossing bursts can hit 2× the intended limit.

Rate limiting by IP — the right way (and the common mistakes)

Why IP rate-limiting is your first defense

When a request hits your service, the IP address is the only stable identifier you have before any authentication or session has happened. It is the obvious axis to rate-limit anonymous traffic on: scrapers, credential stuffers, spam form submissions, exploratory vulnerability probes, and unauthenticated DDoS all hammer your endpoints from one or a small number of IPs. Limiting requests-per-IP-per-time is the cheapest way to bound the damage.

The same simplicity that makes IP rate limiting attractive is also what makes the edge cases bite. IP is not a user identifier — it's a network identifier. Many users share one IP, one user has many IPs, and the IP you see in your handler may not be the IP the request actually came from. This article walks through the production considerations.

Pick the right algorithm

Token bucket — your default

Each IP has a bucket holding up to N tokens. Each request consumes one. Tokens are refilled at rate R per second up to the cap N. If the bucket is empty when a request arrives, it's rate-limited.

Properties: allows bursts up to N requests at any one moment, while enforcing an average rate of R per second. Fits in a single Redis script using INCRBYand EXPIRE calls. Most public API rate limits (GitHub, Stripe, Slack, Twitter/X) are token buckets at their core.

Sliding window — when exact counts matter

Store a timestamped list of requests in the last window. Count entries newer than now - window; reject if the count exceeds the limit. Memory cost is O(requests-in-window) per IP; CPU cost is dominated by trimming the list. Use this when you need precise compliance (e.g., "exactly 100 requests per minute, no more") rather than a smoothed approximation.

Sliding window counter — the practical compromise

Keep two fixed-window counters (this window and the previous one), and compute the effective rate as a weighted average. Cheaper than the log-based sliding window, more accurate than the fixed-window naive version. The default for most high-throughput limiters.

Leaky bucket — for output smoothing

Requests enter a queue; they leave at a constant rate. Excess requests overflow the bucket and are dropped. Less useful for inbound rate limiting (you usually want to allow bursts) but valuable for outbound smoothing — e.g., capping your application's calls to a third-party API at exactly N/sec to stay inside their rate limit.

Fixed window — what to avoid

The naive "count all requests this minute, reset on the boundary." The problem is that an attacker can hit the cap once just before the boundary and again just after, putting through 2× the intended rate in a one-second window across the boundary. Easy to implement, easy to exploit. Don't ship this in production for anything you care about.

Get the IP right — the X-Forwarded-For trap

If your application sits behind a CDN, load balancer, or reverse proxy, the peer IP your handler sees is the proxy's, not the client's. The real client IP is in headers set by the proxy:

X-Forwarded-For: client, proxy1, proxy2 — a comma-separated chain. The first entry is the original client, subsequent entries are intermediate proxies in order. Be careful: clients can spoof this header in the absence of a trusted proxy stripping or appending it correctly.
CF-Connecting-IP — Cloudflare's variant. Always the original client, always single-value, set by Cloudflare's edge and not pass-through-able.
True-Client-IP — Akamai and some other CDNs.
Forwarded: for=... — the RFC 7239 standard. Less common but spec-correct.

The right pattern is: configure your framework's trusted-proxy list explicitly. Trust the proxy's headers only when the request came from a known proxy IP. For requests from elsewhere, ignore the header and use the peer-socket IP. Frameworks like Express, Rails, Django, and FastAPI all have this configuration; use it. IPFerret's request headers tool shows exactly what headers a given request carried, which is invaluable when debugging a flaky setup.

The shared-IP problem

Per-IP rate limiting assumes IP-to-user is roughly one-to-one. In reality:

CGNAT pools share one public IP across hundreds of residential subscribers — and a tighter pool of dozens on mobile carriers.
Corporate networks egress hundreds or thousands of employees through a small number of NAT'd public IPs.
Universities, libraries, hotels, airports all run shared egress. The same IP can represent dozens of concurrent legitimate users.
Mobile carriers shift customers between gateways periodically — the same user can switch IPs mid-session.

A naive "60 requests per minute per IP" limit, applied to an IP shared by 50 users, gives each user about 1 request per minute — enough to break the application for everyone. The defensive moves:

Set the per-IP limit at a level that tolerates moderate sharing. Multiply a single-user limit by 3–5× to leave room for shared egresses. If a real user makes 30 requests per minute peak, 150–200 per IP is sane.
Add a tighter per-session/per-account limit on top. The IP limit stops broad abuse; the session limit catches the specific abusive user without hurting their neighbors on the same IP.
Use connection-type metadata. If your geo-IP provider tags the IP as "datacenter" or "mobile," tune the limits per category. Datacenter IPs sending consumer-style traffic deserve stricter limits; mobile-carrier IPs deserve more generous ones because they're heavily shared.
Distinguish well-known shared infrastructure. Some IPs you should rate-limit differently from defaults — known Tor exits, known VPN ranges, large corporate NAT IPs. Most providers (MaxMind GeoIP2, IPinfo) sell this as a feature.

IPv6 changes the calculus

IPv6 prefixes are typically /64 per customer — meaning every device on that customer's network has a different /128 address inside their /64. If you rate limit by /128, an attacker can rotate trivially within their allocated block. If you rate limit by /32 or /48, you risk treating an entire ISP region as one IP.

The conventional answer is to rate-limit IPv6 by /64 — one bucket per customer allocation. That gives the same logical granularity as IPv4 per-customer rate limiting. Some operators go further and limit by /48 for routing-level abuse (entire allocation blocks), with a separate /64 limit on top for individual customers.

What to do when the limit fires

Returning HTTP 429 Too Many Requests is the obvious answer. Make it informative:

Include Retry-After: 30 (or whatever the recovery time is) so well-behaved clients back off appropriately.
Include rate-limit headers (X-RateLimit-Limit,X-RateLimit-Remaining, X-RateLimit-Reset) on every response, not just the 429s, so clients can self-throttle before they trip the limit.
Don't return a 200 with an error body. Some frameworks default to this; it confuses CDNs and clients that look at status codes to retry.
Log every 429 with enough metadata to investigate later — which endpoint, which limit, which IP, what user-agent. A good rate-limit logging story is the difference between "someone is abusing us" and "we have data on exactly what."

Combine signals — never rely on IP alone

The strongest rate-limiting systems use IP as one signal among several:

Account ID for authenticated requests — the most accurate identifier when present.
Session cookie for anonymous-but-stable users.
Browser fingerprint (header combinations, TLS JA3 fingerprint) for distinguishing rotating IPs that look similar otherwise.
Behavioral patterns — characteristic request sequences, timing, user-agent consistency.

A token-bucket per IP catches volumetric abuse; per-account catches account abuse; per-fingerprint catches account-rotation abuse; behavioral catches the sophisticated attacks that successfully rotate everything else. The layers compound.

Where to put the limiter

Three common architectural placements, each with trade-offs:

Edge / CDN layer. Cloudflare Rate Limiting, AWS WAF, Fastly's edge rate limiter. Fastest response, lowest cost per blocked request, runs before the request reaches your origin. The right place for volumetric and anti-DDoS limits.
API gateway / load balancer. Kong, Envoy, NGINX, HAProxy. Per-route configuration, integrates with your monitoring. The right place for tenant-aware or per-API-key limits.
Application layer. In your handler code, via Redis or a specialized library. Most flexible, slowest, most expensive per blocked request because the request already arrived. The right place for business-logic limits that depend on application state.

Most production systems use a layered combination: edge limits for the obvious abuse, gateway limits per tier, application limits for the per-feature business rules.

Rate limiting by IP — the right way