The "Redundancy Tax": How Prompt Caching & The Rule of 3 Fix AI Margins · Field notes

Imagine you hire a brilliant, world-class consultant to solve a complex engineering problem. They charge you by the word for everything they read and write.

But there’s a catch: this consultant has total short-term memory loss.

Every time you ask a follow-up question—even if it’s just 10 seconds later—they have forgotten who you are, what your company does, and the 50-page technical manual you just gave them. To answer you, they force you to re-upload the entire manual, and they charge you to re-read every single word of it. Again. And again. And again.

This sounds insane. No CTO would sign that contract.

Yet, this is exactly how we have been building LLM applications for the last two years.

This is the "Redundancy Tax." In context-heavy applications—like RAG pipelines, document QA, and coding assistants—70–90% of your input tokens are repetitive, static data that are rebilled at full price on every request.

But the architecture of AI is shifting. With the introduction of Prompt Caching (specifically Anthropic’s implementation), we are moving from a world of "stateless" waste to "stateful" efficiency.

Here is the technical and economic breakdown of why this matters—and the specific math you need to check before you implement it.

The Economics: Token Arbitrage and the "Rule of 3"

Prompt Caching isn't just a discount; it’s a fundamental shift in unit economics. It allows you to store the "state" of a prompt prefix in the model's high-bandwidth memory.

The pricing model creates an arbitrage opportunity, but you have to understand the spread (using Claude 3.5 Sonnet pricing):

Standard Input: ~$3.00 / MTok
Cache Write: ~$3.75 / MTok (25% Premium)
Cache Read: ~$0.30 / MTok (90% Discount)

The Real Break-Even Point: 3 Requests

You might see hype claiming you save money immediately. The math says otherwise, because of the 25% "Write Premium," the first request puts you in the red.

You need to follow the Rule of 3:

Request 1 (Write): You pay the premium. You are effectively losing money compared to a standard call.
Request 2 (Read): You get the discount, but you are still recovering the sunk cost of the write.
Request 3 (Read): This is your break-even point. Your cumulative spend is now lower than it would have been without caching.
Request 100: You have reached asymptotic savings of up to 83% (depending on your ratio of cached to uncached tokens).

The Critical Caveat: This isn't just about total volume—it's about density. You need 3 requests within the 5-minute cache window to break even. If your traffic is sparse (e.g., 1 request every 10 minutes), you will never hit the cache, and you will pay the write premium every time.

The Architecture Shift: System 1 vs. System 2

Until now, AI engineering has been a battle for context. You had a limited budget. If you filled the context window with 50 pages of documentation, you couldn't afford to let the model "think" or generate long answers.

Prompt Caching enables a System 1 vs. System 2 hybrid architecture:

System 1 (Cached Context): This is your static knowledge base—your codebase structure, API definitions, or compliance constraints. Because caching makes this layer 90% cheaper (on read), you can afford to load significantly more context.
System 2 (Dynamic Reasoning): This is the user's specific query and the model's fresh answer. This remains expensive, but because you saved so much on System 1, you can afford deeper reasoning loops.

The Evolution of RAG (Not the Death of It)

You may hear that "RAG is dead." That is an oversimplification.

Micro-RAG is dead: For session-specific contexts (e.g., a user uploads a 50-page PDF for a Q&A session), chopping that document into vector chunks is now inefficient. Just cache the whole document.
Enterprise RAG stays: You cannot cache a 1TB knowledge base. You still need RAG to retrieve the top 20 relevant documents. However, once retrieved, you can cache those results for the duration of the session, making the conversation fluid and cheap.

This also unlocks "Many-Shot Prompting." Instead of the standard 3-5 examples (few-shot), you can provide 20–50 robust examples in the cached block. The model develops "muscle memory" for your specific tasks, drastically reducing hallucinations without the per-token cost penalty.

The Performance Bonus: 50-85% Lower Latency

Cost is what gets the CFO's attention; latency is what gets the engineers excited.

In a standard request, the model performs heavy matrix multiplication ($O(N^2)$ complexity) to process your input. It has to "understand" your system prompt from scratch every time. With caching, the model skips the "pre-fill" phase entirely and retrieves Key-Value (KV) states directly from memory.

For Small Prefixes (<2k tokens): Expect modest gains (10–30%).
For Large Prefixes (5k+ tokens): The gains are massive. Time To First Token (TTFT) can drop by 50–85%. A coding assistant loaded with a 10k-token library reference can start generating code in 300ms instead of 3 seconds.

The 4 Constraints: Where Caching Fails

This is not a "turn it on and forget it" feature. If you ignore these constraints, you will degrade performance or increase costs.

1. The Floor (1,024 Tokens)

Anthropic enforces a minimum prefix size.

Claude Sonnet: Minimum 1,024 tokens.
Claude Haiku: Minimum 2,048 tokens.
Behavior: If your system prompt is below this threshold, caching is automatically bypassed. You won't pay the premium, but you also won't get any benefits—it will be processed as a standard request.

2. The Window (TTL is 5 Minutes)

Anthropic's default Time-To-Live (TTL) is 5 minutes. The timer resets every time you hit the cache.

The Danger Zone: If your traffic is sporadic (< 0.6 RPM), the cache will expire before you hit the "Rule of 3," forcing you to pay the Write Premium repeatedly.
The Fix: For sparse traffic or batch jobs, consider Anthropic's 1-hour extended TTL (at roughly 2x the write cost) or disable caching entirely for those endpoints.

3. The "Space Bar Killer" (Exact Matching)

Caching is cryptographic, not semantic. It relies on a hash of your prompt prefix.

Prompt A: {"role": "system", "content": "You are helpful."}
Prompt B: {"role": "system", "content": "You are helpful. "} (Note the space).
To a human, these are identical. To the cache, they are different keys. If you have a timestamp in your system prompt or unsorted JSON keys, you will miss the cache 100% of the time.

4. Invalidation Lag

If you deploy a code change that updates the system prompt, the old cache doesn't magically disappear—it just expires naturally.

Risk: In high-concurrency environments, you may experience a brief "split-brain" period where some requests hit the old cached prompt while others write the new one.
The Fix: Ensure your deployment strategy handles versioning gracefully if you require immediate consistency across all users.

Why Observability Is Non-Negotiable Here

This is where "flying blind" gets dangerous.

If you implement caching but don't track your Cache Hit Rate, you are likely losing money. You need to verify:

Hit Rate: Is it >66%? (The break-even line). If it's <50%, you are definitely underwater.
Traffic Patterns: Are requests spaced too far apart (violating the 5-minute TTL)?
Drift: Did a stray character break your cache key?

At PromptMetrics, we track the exact economics of your prompt engineering. We visualize your cache efficiency so you can see if you're actually saving the 83% promised, or just paying the premium.

The Redundancy Tax is optional. But only if you do the math.

Want to see your actual Cache Hit Rate?

Apply for the PromptMetrics Private Beta