Skip to main content
On this page
Engineering
14 min read

Context Engineering for AI Agents: Beyond IVR & Flow Builders

I
Izzy

Learn why most AI agents fail by forcing complex requests into rigid paths and how context engineering offers a better approach.

Context Engineering for AI Agents: Beyond IVR & Flow Builders

Every company building "AI agents" with decision trees is building expensive IVR systems with better marketing. Only 8.9% of chatbot interactions actually resolve the customer's stated goal (Parloa, 2026). 67% of customers have hung up on an IVR out of frustration (WifiTalents, 2026). The problem isn't that users hate AI. It's that most "AI" isn't actually reasoning. It's routing.

This piece covers the three eras of customer interaction. You'll see why progressive disclosure beats hardcoded flows. And you'll learn how to build agents that actually reason.

TL;DR:

  • Only 8.9% of chatbot interactions resolve the customer's goal (Parloa, 2026).

  • Flow-based AI systems fail because they force complex requests into rigid if/then paths.

  • Context engineering feeds the model only what it needs, exactly when it needs it.

  • The result: fewer hallucinations, lower token costs, and agents that improve automatically as LLMs get smarter.

The Conventional View: Flow-Based AI Agents (Era 2)

58% of chatbot project failures trace back to wrong-path decisions made in the first 30 days (McKinsey/Forrester via Neontri, 2025). Most teams don't fail because they picked the wrong LLM. They fail because they picked the wrong architecture.

Most current AI agents use predefined decision trees or flowcharts. They understand natural language but route every request through strict if/then logic. It's predictable. Product managers can see every path. Engineers can debug branches. Compliance teams love the audit trail.

This architecture descends directly from IVR phone trees ("Press 1 for billing"). We swapped touch-tone menus for NLP intent classification. The underlying structure never changed. A customer says something. The system maps it to an intent. Then it follows the branch. No intent match? Fallback to a human.

So we traded menus for natural language. But did we actually build something smarter, or just something prettier?

Traditional chatbot platforms, legacy CX vendors, and any team that values "control" over capability push this model hard. And it's not entirely wrong. For simple, single-intent tasks. Password resets, order tracking, and status checks. A flow works fine. The problem is that real customer service isn't simple.

If real customer service were simple, would 64% of customers still prefer you didn't use AI?

64% of customers would prefer that companies not use AI for customer service. 53% would consider switching to a competitor because of it (Gartner via California Management Review, 2026). That's not a model quality problem. That's an architecture problem.

CITATION CAPSULE: According to a 2025 McKinsey/Forrester analysis, 58% of chatbot project failures trace back to wrong-path decisions made in the first 30 days of design (Neontri, 2025). This means the architecture choice, not the model choice, is the primary failure mode.

Why Flow-Based Agents Are Wrong

At 128K tokens, hallucination rates nearly triple to 3.19% (arXiv, March 2026). By 200K tokens, no tested model stays below 10% fabrication. Context length is the strongest driver of increased hallucination. Flow-based agents make this worse by design.

You wouldn't load your entire hard drive into RAM. So why are you doing exactly that to your LLM?

Real human requests don't fit into branching trees. They zigzag. A customer might ask about a refund, mention a product defect, and request expedited shipping. All in one sentence. Flow-based systems explode combinatorially. Every multi-intent query becomes an edge case. And edge cases in flow-based systems don't get handled. They get escalated.

Problem 1: Context overload. When you hardcode every rule into the system prompt, the model's context window fills with irrelevant data. Even with perfect retrieval, LLM reasoning degrades 13.9% to 85% as input length increases (EMNLP 2025). Models typically degrade 30–40% before their advertised context limit, with 30%+ accuracy drops for information placed in the middle of long contexts (BenchLM/Zylos Research, 2026). The sheer length of the input itself hurts performance. It doesn't matter how good your retrieval is if you're drowning the model in noise.

Problem 2: Maintenance nightmare. Every new product, policy, or edge case requires the addition of new branches. A flow that handles 50 intents might need 250 branches to handle pairs of intents. Triple-intent queries? You're into the thousands. Most teams stop maintaining their flows after launch. The bot slowly rots. 82% of senior leaders say their teams invested in AI for customer service in the last 12 months. Yet only 10% have reached mature deployment (Intercom, 2026). The investment isn't the problem. The architecture is.

How many branches can your team maintain before the bot starts rotting?

Problem 3: Reasoning ceiling. Flow-based agents don't reason. They route. They can't handle novel situations because every path must be pre-imagined by a human. When a customer says something the designer didn't predict, the system breaks. Gemini-2.5-Pro achieved only 41.1% accuracy in identifying which step caused a hallucination in multi-step agent trajectories (AgentHallu, January 2026). Even the best models struggle to debug rigid paths. That's not AI. That's a script with delusions of grandeur.

If the designer didn't predict it, how can the flow handle it?

The dirty secret: Most "AI agent" platforms are just visual flow builders with an LLM slapped on top for natural language understanding. The LLM classifies intent. Then the flow takes over. The model never gets to reason about the actual problem. It's a $7,000/mo IVR system.

What the Data Actually Shows

72% of enterprises are already using or testing AI agents (Zapier, 2026). Yet 88% of those agents never reach production (Digital Applied, 2026). The gap between pilot and production isn't a model problem. It's a context problem.

The best-performing AI systems aren't the ones with the most rules. They're the ones with the best context management. Sierra built agents that don't use decision trees at all. They start with minimal instructions. Then they dynamically surface relevant policies, product data, and user history only when the conversation triggers them. Bret Taylor calls this "defense in depth." Multiple supervisor models monitor the agent in real-time, each operating within a tightly scoped context (WSJ).

If the best systems don't use decision trees, why are you still drawing flowcharts?

The mechanism is progressive disclosure. Here's how it works:

  1. Start with minimal base instructions. Identity, goals, constraints. Not every policy in the company.

  2. Detect conditions. The user mentions product X. They log into their account. They express frustration.

  3. Inject only the relevant context at that moment. Return policies for product X. Account history. Escalation thresholds.

  4. Let the LLM reason freely within the current context boundary. No predefined paths. Just the right inputs at the right time.

Why dump the entire policy manual into the prompt when the user only asked about one product?

This isn't theoretical. Progressive disclosure of tool schemas reduces token usage by 85–100× compared with static loading (Matthew Kruczek/EY, January 2026). Claude Opus 4's tool-selection accuracy jumped from 49% to 74% with lazy progressive loading (Anthropic via Kruczek, 2026). Even with perfect retrieval, models struggle when evidence is diluted across long contexts (arXiv:2601.02023, January 2026). Progressive disclosure keeps the evidence concentrated. The model gets exactly what it needs. Nothing more.

CITATION CAPSULE: According to a March 2026 arXiv study evaluating 35 open-weight models, hallucination rates at 128K tokens nearly triple those of 32K-token baselines, with no model staying below 10% fabrication at 200K tokens (arXiv:2603.08274v1). This means context length is the single strongest driver of LLM failure, and flow-based architectures force you to maximize it.

Flow-Based vs. Context-Engineered Agents Relative performance index (lower is better for cost/risk; higher is better for accuracy)

Source: Synthesized from arXiv:2603.08274v1, EMNLP 2025, and Digital Applied 2026 data

Watch on YouTube: Sierra co-founder Clay Bavor on Making Customer-Facing AI Agents Delightful

The Better Approach: Context Engineering

AI agents resolve customer issues at $0.62 per conversation, compared with $7.40 per conversation for human agents (McKinsey/Digital Applied, 2026). But those savings collapse when the agent hallucinates, routes incorrectly, or escalates to a human after wasting the user's time. Context engineering is how you keep the savings and lose the failure modes.

Context engineering means treating context as a dynamically managed resource. Not a static dump. The system starts with a minimal base prompt. Then it conditionally injects relevant information based on the conversation state. The model reasons. It doesn't route.

Core principles:

  • Minimal base prompt. Identity, goals, constraints. Not every return policy in the company is the same.

  • Conditional context injection. Surface data based on triggers, not predefined paths.

  • Let the LLM reason. Give the model the right inputs. Then trust it to handle novel situations.

  • Future-proofing. As underlying LLMs improve, context-engineered agents automatically get smarter. Flow-based agents stay exactly as dumb as the day they shipped.

Your LLM provider just shipped a smarter model. How many branches do you need to rewrite to take advantage of it?

Our finding: When we shifted from monolithic system prompts to conditional context blocks, our multi-intent query accuracy jumped significantly. More importantly, our maintenance load dropped. We weren't rebuilding branches every time the product team added a feature. We were adding context blocks.

Watch on YouTube: Bret Taylor on the Future of Company-Branded AI Agents

The tooling is getting better, too. Sierra provides an AI assistant called Ghostwriter, a visual UI, and a developer SDK to structure this context automatically. But you don't need their platform to apply the principles. You need three things: a way to detect conversation state, a way to retrieve relevant context, and a way to inject it cleanly into the prompt.

CITATION CAPSULE: A 2026 McKinsey analysis found that AI agents resolve customer issues at $0.62 per conversation, compared with $7.40 for human agents, but savings collapse when agents hallucinate or escalate (Digital Applied, 2026). Context engineering is the mechanism that preserves those savings while eliminating the failure modes.

How to Apply Context Engineering

Even with perfect retrieval, LLM reasoning degrades 13.9% to 85% as input length increases (EMNLP 2025). The fix isn't a bigger model. It's a smaller, smarter context.

Immediate action: Audit your current agent's system prompt. If it's over 2,000 tokens, you're probably doing it wrong. Most of that bloat is irrelevant for any single conversation.

When was the last time a user actually followed your script exactly?

Step 1 (5 minutes): List the 5 most common conversation triggers for your agent. These include "user mentions a specific product," "user asks for a refund," or "user provides an order number." Separate them from the base prompt.

Step 2 (30 minutes): Build conditional context blocks. These are text chunks that load only when a trigger is detected. A refund block. A product-spec block. An escalation block. Keep each block under 500 tokens.

Step 3 (ongoing): Measure two metrics. Token count per request, and accuracy on multi-intent queries. Both should improve. Track hallucination rate, cost per conversation, and escalation rate to human agents.

CITATION CAPSULE: Enterprises using AI agents for customer support achieve a 41.2% median deflection rate and a 71% reduction in cost-per-resolution compared to all-human baselines, but 88% of agent projects still fail to reach production due to architectural and context management issues (Digital Applied, 2026; Zapier, 2026).

Cost Per Conversation: AI vs Human Agents

Source: McKinsey/Digital Applied 2026 analysis

The Honest Caveats

Only 10% of teams have reached mature deployment where AI is fully integrated into support operations at scale (Intercom, 2026). Context engineering isn't a magic wand. It requires better tooling than most teams currently have. You need a system that evaluates conversation state, retrieves relevant context, and injects it cleanly. That's not a feature of most no-code chatbot builders.

If context engineering is so much better, why isn't everyone doing it already?

Where do flows still work? Simple, single-intent interactions. Password resets. Order tracking. Status checks. If the user's request never deviates from one predictable path, a flow is fine. Don't over-engineer it.

And yes, dynamic context retrieval adds latency. You're making extra calls to decide what context to load. The savings come from accuracy and reduced maintenance. Not always from raw compute. If your LLM provider's API is already slow, this might make it even slower. Measure it.

CITATION CAPSULE: Despite 82% of senior leaders investing in AI for customer service, only 10% have reached mature deployment where AI is fully integrated at scale (Intercom, 2026). This maturity gap exists because most teams lack the tooling to evaluate the conversation state and dynamically inject context.

Frequently Asked Questions

But doesn't flow-based design give me more control?

It gives you the illusion of control. 58% of chatbot failures trace back to wrong-path decisions made in the first 30 days (McKinsey/Forrester via Neontri, 2025). Every branch you add is a branch you'll maintain. Context engineering gives you control through constraints and guardrails. Not pre-mapped paths.

What if I've already invested heavily in flow-based agents?

You don't have to throw anything away. 88% of agent projects never reach production because teams try to rebuild from scratch rather than iterate (Digital Applied, 2026). Start by externalizing your decision-tree logic into context blocks. The same content becomes reusable instead of locked into branches. A refund flow becomes a "refund context block" that loads when detected. The transition is incremental.

How do you respond to vendors who say their visual flow builder is "no-code AI"?

Visual flow builders are no-code IVR, not no-code AI. Only 8.9% of chatbot interactions actually resolve the customer's goal, and the problem isn't the model. It's the architecture (Parloa, 2026). Real AI reasons. If your tool doesn't trust the model to handle novel inputs, you're not building an AI agent. You're building a script. And scripts were already solved in the 1990s with touch-tone menus.

How much does dynamic context retrieval cost in latency?

Dynamic context retrieval adds one extra inference call to evaluate conversation state and load relevant blocks. That adds latency. But it also reduces token usage dramatically. Progressive disclosure of tool schemas reduces token usage by 85–100x compared with static loading (Matthew Kruczek/EY, January 2026). The net effect depends on your retrieval speed. If your LLM API is already slow, measure it.

Can I mix flow-based and context-engineered approaches?

Yes, and most production systems eventually do. Simple, single-intent interactions. Password resets, order tracking. Those don't need reasoning. A flow handles those just fine. Complex, multi-intent conversations benefit from context engineering. 82% of senior leaders invested in AI for customer service, yet only 10% reached a mature deployment because they tried to force a single architecture everywhere (Intercom, 2026). Use flows where they fit. Use context engineering where they don't.

Conclusion: Time for an Industry Shift

The industry is stuck building smarter IVR systems and calling them AI agents. Context engineering is the actual paradigm shift. Stop measuring agent quality by "path coverage." Start measuring it by "how well it handles the conversation I didn't predict."

As LLMs improve, context-engineered agents compound in value. The same minimal base prompt gets smarter as the underlying model improves. Flow-based agents compound in maintenance debt. Every new product launch breaks your branches.

Are you building an agent that reasons, or a script that routes?

The teams that figure this out first will be the ones running 88% of the agents that actually make it to production. Everyone else will be stuck in pilot hell, debugging decision trees while their competitors scale.

Want to see how Sierra and other leaders approach this? Watch Clay Bavor's deep dive on industrial-grade customer-facing AI agents (Sequoia Capital) or Bret Taylor's discussion on the future of company-branded AI agents (WSJ).

Self-hosted prompt registry + agent telemetry. Zero vendor lock-in. Runs on a $5 VPS.

Up next

Explore more from the blog

Engineering notes, release updates, and honest takes.

Get the best of the prompt engineering blog delivered to your inbox

Join thousands of AI enthusiasts receiving weekly insights, tips, and tutorials.