Rate Limiting Your SaaS API in 2026: The AI Scraper Problem, Token Buckets, and the Layered Defense That Actually Works

The first time abusive traffic cost me real money I was asleep. An endpoint on a side project of mine, a small “ask my docs” widget powered by GPT, got scraped twenty-three thousand times between 1am and 4am by a single client. The client identified itself as ChatGPT-User. I had no rate limit on the endpoint. I had no auth, because the widget was meant to be public. I had a $40 OpenAI budget alert that fired at 5am and a $312 invoice waiting for me at 7am. The traffic spike did not break the app. It just quietly drained the budget, hammered the API key, and made a billing screenshot I now keep pinned in a Slack channel called lessons-learned.

That was the morning I stopped treating rate limiting as a “later” problem. The internet I grew up on assumed clients were polite by default. The internet of 2026 has a billion AI agents acting on behalf of users, and politeness is no longer the default. The default is “as fast as my retries policy allows.”

This is the post I wish I had read before that overnight scrape. It covers what rate limiting actually buys you, the three algorithms worth knowing, where to put the limiter in your stack, the AI scraper wave that broke a lot of assumptions, and the layered defense I run today.


What Rate Limiting Actually Buys You

A rate limiter is a piece of code that says “no” to too many requests. That is the whole definition. Like with feature flags, everything else is engineering on top of one simple mechanism.

The reason to add one is rarely “stop a malicious attacker.” Most malicious attackers will route around your rate limit through rotating proxies and you will need a different tool. The reason to add a rate limiter, the honest one, is one of three.

The first is cost protection. You have an endpoint that costs you money to serve. Every call to your AI feature is an OpenAI or Anthropic invoice. Every call to your image generation endpoint is a Replicate bill. Every minute of compute is a Vercel function-time charge. A rate limit caps the worst case. Without one, a single misbehaving client can turn your weekend into a billing investigation.

The second is fairness. You have ten users. One of them, intentionally or not, is consuming 90% of your capacity. The other nine are getting slow responses. A per-user rate limit forces a fairer distribution and stops one heavy client from making the product feel broken for everyone else.

The third is system protection. Your database has a connection pool. Your background queue has a throughput ceiling. Your downstream APIs have their own limits. Without a rate limiter at the edge, a traffic spike of any shape can blow through one of those internal limits and cascade into a real outage.

For indie SaaS the first reason almost always dominates. Cost protection. Everything else is a bonus. If you are running AI features, image generation, video transcoding, or anything where the per-request cost is meaningful, rate limiting is not optional. It is part of basic financial hygiene.


The Four Things People Confuse With Rate Limiting

Most of the time when someone says “I have rate limiting,” they mean one of four things, and only one of them is actually rate limiting. The distinction matters because the wrong choice leaves you exposed in a way you will not notice until something goes wrong.

Authentication is not rate limiting. Requiring an API key means you know who the caller is. It does not stop the caller from hammering you. A bad actor with a valid API key can still drain your budget.

Authorization is not rate limiting. Checking that the caller is allowed to read this resource is correctness, not throttling. A user authorised to call an expensive endpoint can call it ten thousand times in a row.

Pricing tiers are not rate limiting. “Free plan gets 100 calls a month” is a pricing decision. The enforcement of that decision happens to use rate limiting under the hood, but the limit itself is a billing limit, not a defense. It assumes the user wants to stay on the right side of their plan.

A WAF is not rate limiting either. Cloudflare’s WAF, Vercel’s edge firewall, AWS WAF, they all filter known-bad patterns and bots. They are a layer above rate limiting, not a replacement. A WAF will not stop a single client from making your endpoint cost a fortune. It will only catch the obvious cases.

Real rate limiting is “this caller, in this window, gets at most this many requests, and then I refuse them.” The caller can be authenticated or not. The endpoint can be public or private. The limit applies regardless.


The Three Algorithms Worth Knowing

There are a half dozen rate limiting algorithms in the textbooks. Three of them matter for real SaaS. The other three are interesting but not the place to start.

Fixed Window

The simplest one. Every minute, every caller gets N requests. When the minute ends, the counter resets. Easy to implement, easy to understand, easy to abuse.

The problem with fixed windows is the boundary. If your limit is 60 per minute and a caller sends 60 requests at 12:00:59 then another 60 at 12:01:01, they have sent 120 requests in two seconds and stayed inside the limit. The fix is the next algorithm.

Sliding Window

The smarter cousin. Instead of resetting at the top of each minute, the window slides continuously. “How many requests has this caller made in the last 60 seconds, right now?”

There are two ways to implement sliding window. One is to store every request timestamp in a sorted set and count entries in the window. Accurate but expensive in memory at scale. The other is the “counter” approximation that weighs the previous window proportionally. Less accurate but cheap, and good enough for almost every indie use case.

The Upstash Ratelimit library uses sliding window with the counter approximation as its default. That is also my default. The accuracy hit is invisible. The cost saving is real.

Token Bucket

The algorithm that gives you bursts. Every caller has a bucket that holds N tokens. Each request consumes one token. The bucket refills at a constant rate. If the bucket is empty, the request is denied.

The interesting property is that a caller can burst up to N requests in an instant (drain the bucket), but their sustained rate is bounded by the refill speed. This matches how real users behave. Humans do not click steadily at one request per second. They click in bursts.

I use token bucket for any endpoint where the natural usage is bursty. Search-as-you-type, autocomplete, batch operations the user initiates and then waits for. I use sliding window for anything where I want to bound sustained behaviour without caring about bursts. AI calls. Webhook endpoints. Anything expensive.

In 2026 the choice is essentially “Upstash Ratelimit defaults are fine.” Pick sliding window for cost protection, pick token bucket for user-facing throttles. Move on.


Where Do You Limit? At The Edge Or In The App

A request travels through layers before it hits your business logic. The order, roughly, is: DNS, CDN/edge network, load balancer, your function or server, then your application code. Rate limiting can happen at any of those layers, and the layer you pick changes the trade-offs in a way that nobody mentions in the tutorials.

At the edge is the cheapest place to drop a request. Cloudflare, Vercel’s edge, AWS CloudFront. The request never reaches your function. You do not pay for the compute. You do not stress your database. The rejection is essentially free.

The limit at the edge is that you only know what the network knows. The IP address, the headers, the URL, the user agent. You do not know which user this is unless they have authenticated. You can do coarse IP-based limits at the edge, and that is what most CDNs ship by default.

In the application is more expensive per rejected request, but you know everything. The user ID. The plan tier. The feature being called. The cost of the work. You can apply nuanced limits that depend on business logic. The trade is that the request has already reached your function and you are paying for the compute to reject it.

In the queue or worker is the third spot, often ignored. If a request enqueues a background job, the rate limit can apply when the worker picks up the job, not when the request comes in. This is the right shape when the work is cheap to enqueue and expensive to perform. The user sees their request accepted instantly. The job sits in the queue if you are over the rate limit. This pattern is genuinely the right answer for AI-heavy workloads, and it ties straight into the background job patterns I default to.

My layered setup combines all three. Edge for the cheap IP-based blanket. Application for per-user, per-endpoint business-aware limits. Worker for downstream-API-aware throttling. None of them on their own is enough. Together they are.


My Default Setup In 2026

For a new project I default to three things, in this order.

Cloudflare in front of everything. The free plan covers the basics. Bot Fight Mode catches the obvious bots. Cloudflare’s basic rate limiting handles “1000 requests a minute from a single IP” without writing any code. This is the layer that catches casual scrapers and the long tail of low-effort abuse. It is also the layer that absorbs DDoS attempts so they never reach your origin.

Upstash Ratelimit in the application. I write per-user, per-endpoint limits in the request handler using the Upstash Redis Ratelimit library. It is serverless-friendly, edge-compatible, and the API is short enough to memorise.

import { Ratelimit } from '@upstash/ratelimit';
import { Redis } from '@upstash/redis';

const redis = Redis.fromEnv();

export const aiLimiter = new Ratelimit({
  redis,
  limiter: Ratelimit.slidingWindow(20, '1 m'),
  analytics: true,
  prefix: 'rl:ai',
});

export async function POST(req: Request) {
  const userId = await getUserId(req);
  const { success, remaining, reset } = await aiLimiter.limit(userId);

  if (!success) {
    return new Response('Too many requests', {
      status: 429,
      headers: {
        'Retry-After': String(Math.ceil((reset - Date.now()) / 1000)),
        'X-RateLimit-Remaining': String(remaining),
      },
    });
  }

  return handleAiRequest(req);
}

The limit numbers should be set based on real usage. I usually start by looking at what my heaviest legitimate user does in a typical hour, double it, and set that as the limit. Then I monitor for a week and adjust. Setting a number out of thin air is the source of most “why is my legitimate user being throttled” tickets.

A spending alert at the provider. The third layer is not rate limiting at all. It is a budget alert on OpenAI, Anthropic, Replicate, whatever you are paying for. Set the budget to about double your expected monthly cost. Set the alert at 50% and another at 100%. If both your rate limiter and your alerts fail, this is the backstop that wakes you up before the invoice does. This is exactly the same instinct as the spending alerts I covered in the Claude pricing survival guide.

That is the whole stack. Cloudflare for the blanket. Upstash for the per-user math. Provider budget for the backstop. Three layers. Maybe two hours of setup. It would have saved me $312 on the night ChatGPT-User showed up.


The AI Scraper Problem

The biggest change in rate limiting between 2022 and 2026 is the existence of polite-looking AI agents that scrape at industrial scale. They identify themselves. They respect robots.txt to varying degrees. They are also responsible for a meaningful fraction of all bandwidth and compute on the open web, and most of them do not match anyone’s mental model of “a bot.”

The user agents to know:

  • GPTBot is OpenAI’s training crawler. Respects robots.txt.
  • ChatGPT-User is OpenAI’s “user is asking ChatGPT to fetch this URL” agent. Usually does not respect robots.txt because the user is the one initiating the fetch.
  • OAI-SearchBot is OpenAI’s search index crawler.
  • ClaudeBot is Anthropic’s crawler.
  • Claude-User and Claude-SearchBot are Anthropic’s user-initiated and search bots.
  • PerplexityBot is Perplexity’s crawler.
  • Perplexity-User is their on-demand fetch agent.
  • Google-Extended is the opt-out signal for Bard / Gemini training.
  • CCBot is Common Crawl, which feeds half the training datasets in existence.

There are dozens more. The list keeps growing. By the time you read this there will be a new one.

The honest assessment of what robots.txt buys you:

It blocks the training crawlers reliably. GPTBot, ClaudeBot, GoogleBot, CCBot, all of them will read your robots.txt and obey a Disallow. If you do not want your content used for AI training, this is the cheapest, most boring, most effective control you have.

It does not block the user-initiated fetches. ChatGPT-User and Perplexity-User and Claude-User are operating “on behalf of a user,” and the prevailing interpretation in 2026 is that user-initiated fetches do not need to respect robots.txt the way crawler-initiated fetches do. Your robots.txt is not going to save your endpoint from a thousand “look up the answer on this URL” requests per minute.

It does not block the bots that lie. There is a non-trivial chunk of automated traffic in 2026 that just uses Mozilla/5.0 and pretends to be a browser. You cannot detect them by user agent alone. You need behavioural signals.

The right shape of defense against AI scraping in 2026 has three pieces.

robots.txt is for crawlers. Set it correctly. Block GPTBot, ClaudeBot, CCBot, PerplexityBot, anything you do not want training on you. Allow the search bots if you want to be indexed.

Rate limiting is for everyone else. Especially the user-initiated agents. A single Perplexity query that fans out to fetch five URLs from your site is fine. A user looping a script that asks Perplexity the same question a thousand times is not. The rate limiter is what catches the second case.

Bot detection is for the liars. This is where Vercel BotID, Cloudflare Turnstile, and similar services start to matter. They use behavioural signals (mouse movement, timing patterns, IP reputation, TLS fingerprints) to distinguish real users from automation regardless of user agent. They are not perfect. They are better than the alternative, which is nothing.


Layered Defense In Practice

The single biggest mistake I see indie devs make is picking one of the three layers and stopping. Robots.txt only. A WAF only. An application rate limiter only. Each one alone has gaps the others fill.

Here is how the layers compose in a real production setup.

Layer 1: Edge / WAF. Cloudflare or Vercel firewall. Drop the easy bots, absorb the basic DDoS, apply a “1000 per minute per IP” blanket. This is the cheap layer that catches 90% of the noise without you writing any code.

Layer 2: Bot detection. BotID or Turnstile on the endpoints that cost money. The bot check is fast and cheap. It does not introduce a CAPTCHA in the UI unless the request looks suspicious. A real user almost never sees a challenge. A scripted agent almost always does.

Layer 3: Application rate limiting. Per-user, per-endpoint, business-aware. This is where you draw the line between “free tier can call this endpoint 100 times a month” and “pro tier can call it 10000.” The rate limiter is identity-aware, so the rules are richer than anything the edge can do.

Layer 4: Worker-level throttling. For background work that hits expensive APIs, throttle at the worker, not just at the request. This protects you from your own retry loops, not just from external traffic. If a job fans out into ten API calls and one of them fails, the retry should respect the same rate limit as the original.

Layer 5: Provider budget alerts. Set them at 50% and 100% of expected monthly cost. This is the backstop that catches everything the first four layers missed.

This sounds like a lot. In practice it is two services (Cloudflare, Upstash), a library (Upstash Ratelimit), a bot check (BotID or Turnstile), and a checkbox on the API provider’s billing page. Total setup is one afternoon. Maintenance is essentially nothing once it is wired up.

The version of this setup that does not work, and that I have shipped more than once, is “Cloudflare is enough, I do not need an application limiter.” Cloudflare is excellent. Cloudflare does not know that this request is a $0.40 OpenAI call and that one is a $0.0002 database read. Application-level limits are the only ones that understand cost. Without them you are protecting your bandwidth, not your wallet.


What To Return When You Reject

A 429 response is not the end of the conversation. It is a message to the caller that says “you can try again, here is when.” How you write that message matters.

Always set Retry-After. Either in seconds or as an HTTP date. Polite clients (including most AI agents) will respect it. Impolite ones will not, but at least you tried.

Always set rate limit headers. X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset. Or the standardised RateLimit headers from the IETF draft. Either is fine. Pick one and be consistent. Your API consumers will write client code against them, and good clients will back off gracefully when remaining is low.

Return a useful body. A JSON error with a code, a message, and a hint. {"error": "rate_limited", "message": "You have hit the per-minute limit for this endpoint", "retry_after": 42}. Better than “Too Many Requests” on its own. Way better than a 500.

Do not 200 silently. I have seen rate limiters that “soft-fail” by returning a 200 with an empty body so the user does not see an error. This is worse than a 429. The client now has no idea anything is wrong. They keep hammering. Their backoff never kicks in. You have made everything strictly worse to avoid an error in the UI.

The pattern in the code block earlier (status 429, Retry-After, headers, JSON body) is the boring correct shape. Use it.


Observability For Rate Limits

A rate limiter you cannot see is a rate limiter you cannot tune. Three signals to wire up on day one.

A counter of rate-limited requests, by endpoint and by user. Whatever your metrics stack is, Prometheus, Datadog, Vercel Analytics, log this. The shape that matters is “how many 429s per minute, broken down by which endpoint and which user.” If a single user is generating 80% of your 429s, that is either an abuse signal or a legitimate user who needs a higher tier. Both are actionable.

An alert on a spike. A sudden 10x increase in 429s usually means either an abuse event or a misbehaving deploy on the client side. Both are worth being woken up for in a way that “individual user hit their limit” is not.

A dashboard you actually look at. Same instinct as the observability stack I run as a solo dev. The dashboard does not have to be fancy. It has to exist and be visible.

The most surprising thing the dashboard tells you is which endpoints are limit-bound and which are not. Half the time the endpoint you spent two days tuning the limit on barely fires it. The other half, an endpoint you never thought about is generating 90% of the throttling. That is also the endpoint you should write your next blog post about, because something is happening there.


What I Got Wrong

Most of what I had to learn about rate limiting the hard way has nothing to do with algorithms. I knew about token buckets. I had read about sliding windows. The bugs all came from somewhere else.

I rate-limited by IP when I should have rate-limited by user. A user on mobile data behind a carrier-grade NAT shares an IP with hundreds of other people. The first time one of them hit the limit, the rest got throttled too. I had no idea why my error rate spiked. The fix was to use the user ID for authenticated requests and only fall back to IP when there was no user.

I forgot to limit my own internal calls. A background job that retried an API call inside a loop was not subject to the same limiter as the external user. It happily made 60 calls a second. The provider’s rate limiter started 429ing me. I blamed the provider. The bug was mine.

I made the limit too tight and triggered legitimate users into retry storms. The retry storm hit the same limit. The limit was holding. The product was broken. The lesson was that a rate limit set without a feedback loop from real users is a guess, and a wrong guess can be worse than no limit at all.

I trusted user agents. I let through GoogleBot without a verification check. The vast majority of GoogleBot traffic is not Google. The fix is to do a reverse DNS lookup to confirm the IP actually belongs to Google. The free version of this is in every CDN’s bot management. Use it.

I assumed rate limiting was enough. The day a botnet of three thousand residential proxies started hitting an endpoint at one request per IP per hour, my per-IP limit did nothing. The total volume was fine for each IP and disastrous in aggregate. The fix was a bot detection layer, not a tighter rate limit. The right tool was not the one I had.

None of these are exotic. All of them are the kind of thing you only learn by shipping the version without them and watching the failure mode.


The Real Lesson

Rate limiting is not a feature you bolt on once and forget. It is part of the financial nervous system of any SaaS that costs money to serve. The day you ship an expensive endpoint without one is the day you discover what your worst-case bill looks like.

The setup that earns its place in 2026 is layered, boring, and cheap. Cloudflare or Vercel firewall at the edge. Upstash Ratelimit in the application, per-user, per-endpoint. Bot detection on the endpoints that cost real money. Worker-level throttling on background work that hits paid APIs. Provider budget alerts as the final backstop. None of it is hard. All of it is the difference between a clean weekend and a $312 invoice from an OpenAI bill you did not approve.

The internet of 2026 is full of AI agents acting on behalf of users at speeds that humans never matched. Most of them are well-intentioned. Some of them are not. All of them assume your endpoint can handle whatever load they decide to send. Your job is to make sure that assumption is wrong in the cases where it matters.

If you only do one thing this week, set a budget alert on every paid API your app talks to. Everything else can be wired up over a weekend. The alert is the one that wakes you up before the bill arrives.

You only have to learn this once. Hopefully you do not have to learn it the way I did.