LLM Vendor Lock-in: Why Switching Costs 10x More Than You Think · Field notes

Why switching LLM providers costs 10x more than your spreadsheet says, and how to break free before it breaks your runway.

You've spent six months optimizing your prompts. They encode your domain knowledge, your edge-case handling, and the kind of behavioral fine-tuning that only comes from production traffic. Right now, they're locked inside a single vendor.

Then your finance team does the math.

Claude 3.5 Sonnet costs $3 per million input tokens and $15 per million output tokens. GPT-4o runs $5/$15. Gemini 1.5 Pro is significantly lower, priced at $1.25–$2.50 for inputs and $10–$15 for outputs. The spreadsheet shows a potential 50% cost reduction. They want to know why you aren't making the change.

So you try. And that's when you discover the trap.

Your prompts don't work on the new model. Not in an obvious way, no errors, no crashes. Worse: they're subtly wrong. Summaries drift. Tone shifts. Safety guardrails get porous.

This isn't a hypothetical. A lead engineer recently shared their nightmare on r/ExperiencedDevs: they spent two weeks building a "semantic conversion layer" to translate prompts between providers. They achieved 85% fidelity, which sounds great until you realize that a 15% gap in production quality is a showstopper.

The "lock-in" isn't about the API. It's about the architecture of modern LLMs working exactly as designed.

The Lock-in Nobody Talks About

When CTOs evaluate LLM vendor risk, they typically think about API compatibility. Swap the endpoint and adjust the request format; done. Tools like LiteLLM and Portkey solved this years ago.

But API lock-in is the shallowest layer. The real cost sits deeper.

There are three types of LLM lock-in:

API Lock-in (Low Risk): Each provider has its own request/response schema. OpenAI uses messages, Anthropic uses content blocks, and Google uses parts. Abstraction libraries handle this. Migration effort: days, not weeks.
Prompt Lock-in (High Risk): A prompt optimized for Claude performs differently on GPT-4, and vice versa. Research from Accenture and UC Santa Cruz quantified this: a prompt optimized for GPT-4 that scored 99.4% on HumanEval dropped to 68.7% when transferred directly to Llama 3 70B. That's a 30-point performance collapse from the same instructions.
Evaluation Lock-in (Critical Risk): The most dangerous and least discussed. Your evaluation suites are tuned to expected outputs from a specific model. Switch providers, and your entire quality framework breaks. Not because the new model is worse, but because your tests assumed the old model's output patterns.

Is This Your Problem? (The Lock-in Checklist)

This might not be your problem. Some teams can switch providers with minimal friction. However, prompt lock-in hits hardest when:

✅ You have 50+ production prompts built over 6–12 months.

✅ Your prompts encode domain-specific logic and edge-case handling.

✅ Output quality is customer-facing and measurable.

✅ You operate under EU data residency requirements (GDPR, AI Act, Schrems II), meaning you can't simply default to a US-only provider if they lose compliance certification.

✅ You're spending $10K+/month on LLM costs (enough to justify switching).

If three or more apply, you're already locked in. The question is whether you manage it or let it manage you.

Why Prompts Don't Port

The API problem was solved. The problem is that your engineering hours are being eaten up, and it's not because your engineers are doing something wrong.

The fundamental issue is architectural. Prompts aren't programming code; they are conditioning mechanisms that guide a model through a high-dimensional probability space.

Every model interprets prompts differently based on its training data, architecture, and fine-tuning. Here's why:

Syntax matters more than you think. OpenAI popularized JSON-based function calling. Anthropic emphasizes XML-tagged structural prompting. Google's Gemini requires distinct safety configurations. These aren't cosmetic differences; they influence the model's attention mechanism. A prompt using Markdown headers for structure might be given lower priority by Claude, leading to instruction drift.
RLHF creates invisible dependencies. Each lab uses a different workforce and set of guidelines to align its models. OpenAI models tend toward conciseness. Anthropic's Claude defaults to thoroughness and explicit reasoning. When your prompt relies on GPT-4's tendency to be brief, moving that prompt to a verbose-by-default model requires rethinking your entire instruction strategy.
"Vibe coding" creates semantic debt. Developers iteratively tweak prompts until the output "feels right" on a specific model. This creates undocumented dependencies on idiosyncrasies. Unlike technical debt, semantic debt does not generate compiler warnings. The failure mode is silent: the system continues to run, but quality degrades.

The Real Cost of Switching

Enterprise AI migration projects average $315,000. For LLM-specific switches, here is what the spreadsheet misses:

The Hidden Migration Ledger

Cost Category	Impact	Visibility
API and compute migration	Token pricing delta + infrastructure	High
Prompt rewriting	Weeks of engineering time per prompt library	Low
Evaluation suite reconstruction	Full regression and validation cycles	Hidden
Quality degradation risk	5-30% output quality drop during transition	Hidden
Delayed feature roadmap	Every migration hour displaces product work	Hidden

Most teams underestimate total migration effort by 2-3x. That 50% cost savings evaporates when you factor in two months of your senior engineer's time, a regression in output quality, and the features you didn't ship.

Breaking Free: A Practical Framework

You don't need a "conversion layer." Translation assumes one canonical prompt. Reality is different: each model benefits from a different prompt architecture.

The winning approach is to embrace that, manage it systematically, and let data drive decisions.

Step 1: Separate intent from implementation.

Define what your prompt needs to accomplish separately from how it's formatted for a specific model. Your intent layer is portable. Your implementation layer is model-specific. Keep both versions.

Step 2: Establish evaluation baselines per provider.

Before considering any migration, run your evaluation suite against 2–3 providers. Not synthetic benchmarks. Your production prompts, edge cases, and scoring against your quality criteria.

Step 3: Maintain parallel prompt versions.

For your highest-value use cases, maintain optimized prompts for at least two providers. The overhead is manageable with proper version control. The insurance value is enormous.

Step 4: Route by capability, not by default.

Once you have parallel prompt versions (Step 3), you can route strategically. Anthropic leads in coding tools (capturing 54% market share, driven significantly by GitHub Copilot's Claude integration). Gemini excels at multimodal and long-context. GPT-4o remains strong for general reasoning. But this only works if you've already invested in optimized prompts for each provider, not if you're trying to apply the same prompt across all providers.

Step 5: Invest in prompt version control.

Multi-model strategies are already mainstream: 37% of enterprises run five or more models in production. But without version control, teams manage prompts in code repositories, spreadsheets, or individual engineers' heads.

There's no way to compare how v3 of a summarization prompt performs on Claude versus GPT-4o versus Gemini. No way to roll back when a model update breaks output quality. No audit trail for compliance teams.

With proper prompt management, the switching-cost problem transforms from a multi-week engineering project into a data query: "Which provider delivers the best quality-per-dollar for this use case, given our constraints?"

Your Prompts Are Your Most Valuable AI Asset

The companies that treat prompt management as infrastructure will have an asymmetric advantage.

When Finance asks about cost optimization, they'll have actual performance-per-dollar data across providers. When Legal raises vendor-concentration risk, they'll already have production-ready alternatives baselined. When Compliance flags EU data residency, it will display prompts already optimized for compliant providers.

You've invested months building prompts that encode real IP domain knowledge, edge-case handling, and behavioral calibration that only comes from production traffic. That IP shouldn't be held hostage by a single vendor's pricing or compliance decisions.

Manage it accordingly.

Building an LLM-powered product and facing pressure to diversify providers? PromptMetrics helps engineering teams version prompts, evaluate performance across providers, and make data-driven decisions about model routing without the conversion layer. Track what matters: prompt performance, cost, and quality across every model you run.