The CTO’s Guide to Token Budgets: How to Set Per-Feature Limits & Prevent Shock Bills · Field notes

It's the email every AI-focused CTO dreads.

It lands in your inbox on the 2nd of the month. Subject line: Invoice Available. You open it, expecting the usual €12,000 for your LLM usage.

Instead, you see €45,000.

Your stomach drops. You check Slack. No major outages. No massive user spikes. You text your Lead Engineer. They don't know what happened either.

Somewhere in your infrastructure, an agent entered a loop. Or maybe a prompt change in the "experimental" feature pushed context windows to the max for 10,000 users.

You do not know which feature caused it. You have no idea who to blame. And now you have to explain to your CFO why you just burned a junior engineer's salary in tokens over a single weekend.

This is the "Black Box" problem. And if you are shipping agents without per-feature token budgets, you aren't managing infrastructure. You're gambling.

The Trap of "Aggregate" Thinking

Most engineering teams treat LLM spend like AWS bills: a significant, aggregate monthly operational expense.

But LLMs are different. In traditional cloud ops, costs scale linearly with traffic. In the world of agents and reasoning loops, costs scale exponentially with complexity.

A single "Chain of Thought" agent getting stuck in a tool-calling loop can burn through $50 in minutes. If that agent is deployed to 1,000 users, you have a financial catastrophe on your hands.

To fix this, stop looking at your "Total Monthly Spend." You need to start thinking in Unit Economics per Feature.

Here is the framework we use to help CTOs move from "shock bills" to predictable, engineered costs.

Step 1: Back into Your "Safe Zone"

Before you write a single line of enforcement code, you need a ceiling.

Don't guess. Look at your ARR (Annual Recurring Revenue) or your project budget.

For early-stage, AI-first SaaS companies, a healthy benchmark for LLM infrastructure is 2–5% of revenue. If you are spending 15%, your gross margins are dead on arrival.

Do the math:

If your product generates €500,000/month, your total AI budget cap is €25,000.

That is your "Global Hard Limit." Now, you have to allocate it.

Step 2: Segment Budgets by "Product Surface"

This is where most teams fail. They set a budget for "OpenAI" or "Anthropic."

But your users don't interact with "Anthropic." They interact with:

The Code Assistant (High value, long context).
The Support Chatbot (Medium value, RAG-heavy).
The Internal Data Analyst (Internal use, high risk of loops).

You must assign a specific financial envelope to each surface based on the value it delivers.

The "High-Value" Surface:

Your Code Assistant drives retention. You tolerate higher spending here.

Model: Claude 3.5 Sonnet / GPT-4o.
Budget: €15,000/month.
Limit strategy: Soft limits. We want this to work, even if it gets expensive.

The "Low-Value" Surface:

Your generic SEO blog generator or FAQ bot.

Model: GPT-4o-mini / Claude Haiku.
Budget: €2,000/month.
Limit strategy: Hard kill switch. If this spikes, shut it down. It's not worth the overage.

Step 3: Enforce Limits at the Gateway Layer

Policy without enforcement is just a suggestion. You need technical guardrails to stop the bleeding before the invoice is generated.

You cannot rely on the model providers for this. OpenAI's limits are account-wide; they won't stop your "Search Feature" from eating your "Support Feature's" lunch.

You need an observability layer—a gateway—between your code and the LLM.

The "Traffic Light" System

At the gateway level (where PromptMetrics sits), you should configure three tiers of defense for every feature:

The Soft Alert (70% of Budget):
When the "Support Bot" hits €1,400 of its €2,000 budget, the Tech Lead and PM get a Slack notification. No action has been taken, but it is being monitored.
The Rate Throttle (90% of Budget):
Traffic is slowing down. Consider switching the model routing dynamically from GPT-4o to GPT-4o-mini to stretch the remaining Budget. You degrade quality slightly to preserve uptime.
The Hard Stop (110% of Budget):
The circuit breaker trips. The API returns a cached response or a static "We are experiencing high load" message. This prevents the €45k surprise.

Step 4: The "Staging" Budget (Where You Save Millions)

The most dangerous code is the code you haven't shipped yet.

We see this constantly: A developer runs a test script on Friday afternoon. They iterate over a dataset of 5,000 prompts using a new, unoptimized system prompt.

They go home. The script keeps running.

By Monday, they've burned €5,000 in a development environment.

The Rule: Your Staging and Dev environments need the strictest budgets of all.

Staging Cap: €50/day per engineer.
Enforcement: Hard stop.

If an engineer needs to run a massive eval, they must request a temporary increase in the limit. This adds a layer of friction, forcing them to double-check their math before hitting "Run."

Governance: The CTO's New Role

As a CTO, your job is shifting. You are no longer just managing uptime; you are managing Token Liquidity.

You need to implement a culture of "Showback."

At the end of every week, your dashboard should show exactly who spent what:

Team Alpha (Search): €400 (Under Budget).
Team Beta (Agents): €1,200 (Over Budget due to loop error).

With this visibility, you aren't the "bad guy" policing spending. You are giving your teams the data they need to own their own P&L.

This is how you achieve the 45:1 ROI we see with top-performing engineering teams. They don't just cut costs; they reallocate waste into innovation.

You wouldn't let an engineer spin up 1,000 EC2 instances without approval. Don't let them spin up infinite token loops without a budget.

The difference between a profitable AI company and a bankrupt one is often just a matter of visibility.

Get a handle on your spend today. Calculate Your AI Waste & Set Your First Budget with PromptMetrics

Expected payback: <14 days. Critical path: Audit current usage → Define per-feature caps → Implement Gateway enforcement.