Prompt Caching in 2026: Anthropic vs OpenAI vs Gemini for Production Apps

I opened the billing dashboard for one of my AI features a few months ago and felt my stomach drop. The feature was working beautifully. Users loved it. Traffic was climbing. And the monthly spend had quietly crossed a line that made me open a second tab to check the math twice. I had been telling myself caching was on the “optimize later” list for about three months. That morning it moved to the top.

What I learned over the next two weeks is that prompt caching is not an optimization. It is the difference between a production AI feature that pencils out and one that eats your margin alive. Get it right and a 200,000 token system prompt goes from budget-breaking to a rounding error. Get it wrong and your cache hit rate sits at 4 percent while you wonder why the bill keeps growing.

Every major provider ships caching now. Anthropic, OpenAI, and Gemini all have their own take on it, and the differences matter more than the docs make obvious. The pricing models diverge. The TTLs diverge. The rules about what invalidates a cache entry are different in ways that will bite you if you assume they work the same. I have shipped cached prompts on all three and broken something on all three. Here is the field guide I wish I had before I started.


Why Caching Became The Whole Ball Game

For most of 2023 and 2024, prompt caching was an optional efficiency trick. You could skip it and still build working AI features. The context windows were small enough and the prompts were short enough that the raw input token bill never got scary.

That changed in two steps. First, context windows grew. A 1 million token window on Claude Opus 4.7 and a 2 million token window on Gemini 2.5 Pro made long context architectures realistic for use cases that used to require RAG. Second, providers noticed that charging full input price for the same 180,000 token system prompt on every request was going to push developers back to retrieval out of pure sticker shock. Caching was the escape valve.

The economics now look like this for a typical long context feature:

  • Without caching, a 200,000 token system prompt at Claude input rates runs about 60 cents per request
  • With caching on a warm cache, the same prompt runs around 6 to 8 cents per request
  • At 10,000 requests per day, that is the difference between $6,000 and $600

An order of magnitude. On a single feature. This is not an “optimization” in any normal sense. It is the price difference between “this business works” and “this business does not.”

The catch is that the 90 percent discount only shows up if you do everything right. A single whitespace change in the cached portion can reset the cache. A TTL that expires in the middle of your daily traffic window wipes out the savings. A multi-tenant design that seemed obvious on paper can turn caching into an accounting nightmare. Context engineering is the umbrella skill here, and caching is the single highest-leverage piece of it.


How Each Provider Actually Implements It

The three providers all solved the same problem, but they made very different choices about ergonomics, pricing, and constraints. If you treat them as interchangeable you will miss the places where each one has a quiet advantage.

Anthropic (Claude)

Anthropic introduced prompt caching in August 2024 and it has become the most developer-controllable of the three. You place cache_control breakpoints explicitly in your prompt, up to four of them, and everything before each breakpoint is cached as a prefix.

The defaults in 2026:

  • Five-minute TTL on the default “ephemeral” cache
  • One-hour TTL available with a slightly higher write cost
  • Cache writes cost 1.25x the normal input price on the five-minute tier, 2x on the one-hour tier
  • Cache reads cost about 10 percent of the normal input price
  • Minimum cacheable block of 1,024 tokens for Opus 4.7 and Sonnet 4.6

The explicit breakpoint model is the thing I like most. You can cache your system prompt, your tool definitions, and the first chunk of conversation separately. You can decide exactly what is stable and what is not. And you can see in the response metadata which cache blocks hit and which did not, which makes debugging a cold cache actually possible.

The gotcha is that the breakpoint position matters. Everything before a breakpoint must be byte-for-byte identical across requests. A single extra newline, a trailing space, a changed date in the system prompt, and the prefix misses. I once spent an afternoon tracking down a 0 percent hit rate that turned out to be a timestamp in the system instruction. Remove the timestamp, cache works.

OpenAI

OpenAI rolled out automatic prompt caching in late 2024 and has kept the interface intentionally minimal. There are no breakpoints to set. The API inspects every request, looks for a cached prefix of at least 1,024 tokens, and charges the cached rate on the portion that matches.

The defaults in 2026:

  • Automatic caching with no opt-in required
  • Cache TTL of 5 to 10 minutes depending on load
  • Cache writes are free (no price premium on the first use)
  • Cache reads are 50 percent of the normal input price
  • Minimum prefix match of 1,024 tokens

The simplicity is genuinely pleasant when it works. You structure your prompt with the stable portion first, you keep the dynamic portion at the end, and the system figures it out. For a lot of use cases this is all you need.

The downsides show up when you need precision. You do not control where the cache breakpoint lands. You cannot cache multiple disjoint blocks the way you can with Anthropic. The read discount is 50 percent rather than 90 percent, which sounds small but compounds fast at volume. And the TTL is shorter and less predictable, which makes it harder to plan around.

For my money, OpenAI caching is the right default for simple cases and a frustrating ceiling for complex ones.

Google (Gemini)

Gemini takes a third approach that I would call “explicit and durable.” You create a cached content object with its own identifier, set an explicit TTL, and then reference that identifier in subsequent requests.

The defaults in 2026:

  • TTLs from 1 minute to 24 hours, you pick
  • Cache storage is billed per hour the content sits in the cache
  • Cache reads are roughly 25 percent of normal input price
  • Minimum cacheable content of 4,096 tokens on Gemini 2.5 Pro
  • Cache objects are regional and scoped to your API key

The long TTL option is the killer feature. On a stable doc set, you can create a cache entry once in the morning and have it serve requests all day. No cold starts, no mid-day TTL refreshes, no worrying about whether your traffic pattern keeps the cache warm. For batch jobs, evals, or low-traffic features that only see requests every hour or two, this is a huge win because the five-minute TTLs on Anthropic and OpenAI would simply expire between uses.

The downside is the storage cost. If you cache content and it sits unused, you still pay for the hours. On a 200,000 token cached object held for a day, the storage bill is not zero. You have to match the TTL to your actual traffic or you bleed money on idle storage.

I have ended up using Gemini caching for long-running async features where the storage cost is predictable, and staying with Anthropic or OpenAI for interactive features where the five-minute TTL matches real user behavior.


The Hit Rate Trap

The single most common caching bug I see, including in my own code, is a cache hit rate that looks fine on paper but is actually catastrophic.

Here is the trap. You deploy a long cached system prompt. You run a load test. You see cache hits on request 2, request 3, request 4. You declare victory. You ship. Then production traffic arrives and your cache hit rate drops to 30 percent for reasons you did not anticipate.

The three things that usually cause this:

Multi-tenancy you did not account for. If your cached prompt includes anything user-specific, like the user’s name, a workspace ID, or a tenant configuration, the cache is keyed per user. Each user sees cold cache on their first request, and users who do not return within the TTL window never get a warm cache at all. The fix is to separate tenant-specific context from stable system context and only cache the stable part.

TTL shorter than your traffic interval. A five-minute TTL works great when you get a request every 30 seconds. It is useless when you get a request every six minutes. If your traffic is bursty or low volume, you are paying cache write prices on almost every request and cache read prices on almost none. Either switch to a provider with longer TTLs (Gemini’s one-hour or 24-hour options) or accept that caching is not going to help your use case.

Silent invalidation from prompt changes. Every code deploy that touches your prompt template invalidates your cache. Every A/B test that changes the system instruction invalidates your cache. Every minor wording tweak that seems harmless invalidates your cache. If you are deploying often, you may be paying cache write costs after every release and never keeping a warm cache long enough to get the read discount.

I now instrument hit rate as a first-class metric. Every AI request logs whether it hit the cache, how many tokens hit the cache, and how many were billed at full price. If the hit rate drops below 80 percent on a feature that should be at 95 percent, I get paged. This sounds paranoid and it has paid for itself twice already.


The Structural Rules That Actually Work

After making every mistake twice, here is the structural pattern I use for any cached prompt.

Stable first, dynamic last. The static portion of the prompt goes at the top. System instructions, tool definitions, shared context, doc sets. The dynamic portion, which includes the user message, per-request state, and anything else that changes, goes at the bottom. The cache boundary lives between them.

No timestamps in the cached portion. If your system prompt includes the current date, the current user, the current anything, it is not cacheable. Move it out. If you genuinely need the date in context, put it in the dynamic portion after the cache boundary.

Strip whitespace carefully. I have lost more hit rate to stray newlines than to any other single cause. When building the cached portion, I now run it through a normalizer that strips trailing whitespace on every line and ensures consistent line endings. The byte-for-byte match requirement is unforgiving.

One cache boundary, sometimes two, rarely more. Anthropic lets you set up to four breakpoints, but in practice I almost never use more than two. One for the system prompt and tool definitions, one for a shared document set. More breakpoints means more places where the prefix can miss, and the cognitive overhead of reasoning about which block is warm is not worth it.

Monitor hit rate per feature. Cache hit rate is not a global metric, it is per-feature. Different features have different cache patterns, different TTL needs, and different failure modes. Track them separately.

Pin the cached portion in source control. Treat the cached portion of your prompt like an API contract. Changes to it cost real money in lost cache warming. Require review. Roll out prompt changes with the same care as database migrations.


A Real Example: A Support Bot At 50k Requests Per Day

Let me get concrete with numbers from a support triage bot I have been running since January. The shape of the feature is the same one I described in the RAG vs long context piece. A 180,000 token system prompt with all the support docs, a short per-ticket message, and Claude Opus 4.7 doing the drafting.

Without caching, the cost math is:

  • 180,000 input tokens at $15 per million = $2.70 per request on the system prompt alone
  • 50,000 requests per day = $135,000 per day
  • This number is obviously not real. We would never have shipped this feature at this price.

With five-minute ephemeral caching on Anthropic:

  • First request of a five-minute window pays a cache write at 1.25x = $3.38
  • Subsequent requests in the window pay cache read at 10 percent of input = $0.27
  • Steady traffic maintains a warm cache most of the time
  • Realistic daily spend ends up around $18,000, of which about $2,000 is cache writes and $16,000 is cache reads plus output tokens

That 85 percent reduction is what makes the feature viable. The absolute number is still significant, but it is a business cost rather than a business catastrophe.

The last piece worth mentioning is that the hit rate itself is a feature of your traffic pattern. This bot gets requests pretty evenly throughout business hours, which keeps the cache warm. A lower-volume feature with the same prompt would have a much worse ratio of cache writes to cache reads, and the economics would look different. Some low-volume features in the same company are better served by Gemini’s long-TTL caching for exactly this reason.


When Caching Will Not Help You

There are categories where caching is just not the right tool, and pretending otherwise leads to disappointment.

Per-user data that cannot be separated from the system prompt. If your application logic genuinely requires user-specific context at the top of the prompt, caching across users is impossible. You can still cache per user, but only if each user generates enough traffic in a five-minute window to hit the cache meaningfully. Most SaaS apps do not.

Highly dynamic doc sets. If your knowledge base changes multiple times per hour, the cache invalidates faster than it accumulates hits. RAG becomes the better pattern because you can re-index incrementally without invalidating the entire retrieval path.

Short prompts. There is a minimum prompt size below which caching is not worth the overhead. If your total prompt is 2,000 tokens, the savings on a cache hit are measured in fractions of a cent per request, and the engineering complexity of maintaining a cached prefix is not free. Save caching for prompts over 10,000 tokens where the math starts to matter.

Agentic workflows with unpredictable tool calls. Agents that call tools, get results back, and call more tools have highly variable prompt structure. The portion of the prompt that changes depends on which tool was called and what it returned. You can still cache the initial system prompt and tool definitions, but the dynamic middle portion is not cacheable, and you should not plan your cost structure around caching discounts that only apply to the first turn. Observability for AI agents is a more impactful investment for these features.


The Multi-Provider Strategy That Works

One pattern I have landed on after running features across all three providers is to choose caching strategy per feature rather than per company.

For a high-volume user-facing feature with stable context and short response-time requirements, I reach for Anthropic. The five-minute TTL matches interactive traffic patterns, the 90 percent read discount is the best on the market, and the explicit breakpoint model makes it obvious what is being cached.

For a simple feature where the team does not want to think about caching strategy, OpenAI is the right default. It works well enough out of the box, the automatic caching behavior is predictable, and there is nothing to configure.

For batch jobs, evals, or low-volume async features, Gemini’s long-TTL caching is the only one of the three that makes sense. The five-minute TTLs on the other two would expire between requests and the cache would never warm up.

Routing through Vercel AI Gateway or a similar provider abstraction makes this kind of per-feature strategy practical. You keep the same application code and change which provider gets called based on the feature’s caching profile. The alternative is either accepting suboptimal caching on some features or scattering provider-specific code throughout your codebase, and neither one ages well.


What To Do Monday Morning

If you are shipping AI features and you have not audited your caching, that is the highest-leverage afternoon of work available to you right now. Pull your billing dashboard. Find the top three features by token spend. For each one, check the actual cache hit rate. Not the “we turned caching on” status, the real hit rate.

If the hit rate is above 80 percent, you are in good shape and can stop reading. If it is below 50 percent, you have a bug, not an optimization opportunity. Something in your prompt is invalidating more than it should. Walk through the structural rules above. Almost always it is a timestamp, a user-specific field, or a whitespace mismatch that moved into the cached portion when nobody was paying attention.

For new features, caching should be part of the prompt design from day one, not retrofitted. Decide where the cache boundary lives before you write the first line of the system prompt. Keep the stable portion stable. Move the dynamic portion to the end. Instrument hit rate before you ship. The habits are small individually and they compound into an enormous cost difference at scale.

The providers are all racing to make caching easier, and the 2026 versions are already much better than the 2024 versions. But the structural work of designing a prompt that actually caches well is still on you. A feature with a well-designed cached prompt costs 10 percent of what the same feature costs without one. That gap is not closing. If anything, it is widening as context windows keep growing and more of your bill lives in the input side of the ledger.

Caching is not the sexy part of building AI features. Nobody is going to tweet about your hit rate. But it is the difference between features that earn and features that bleed, and in 2026 that distinction is the whole game.