The failure pattern: one model call, infinite blast radius
You can spot the brittle AI product fast: every user request hits the “best” model with a long prompt and a prayer. It demoed well in 2024. In 2026 it creates the worst combination: unpredictable output, incidents nobody can replay, and a bill that grows faster than usage.
The AI products people trust—repo-aware coding help, support that quotes the right policy, back-office automation that doesn’t corrupt records—aren’t “one big model.” They’re compound systems: multiple model tiers, retrieval, tool execution, policy gates, traces, and fallback paths wired like any other production service.
Three constraints force this: cost, latency, and risk. Real workflows aren’t “generate text.” They’re sequences: detect intent, assemble context, decide what’s allowed, take actions, then prove what happened. Each step has different error tolerance and a different price/performance ceiling. Running every step through the priciest model is how you get slow interactions and fail governance because you can’t explain sources, permissions, or why an action was taken.
The teams pulling ahead build a control plane: route each step to the cheapest safe option, ground output in approved data, lock tool access down like an API surface, and measure outcomes like any other business-critical system. That’s why the conversation moved past prompt tweaks to routing, retrieval quality, eval harnesses, and audit trails.
By 2026, the stack looks like a service mesh for “intelligence”
Production AI now looks less like a prompt and more like a mesh: components with clear contracts and measurable behavior. The pattern that keeps winning is separation of concerns: (1) routing, (2) grounding, (3) execution, and (4) verification. Each part is testable, observable, and swappable without rewriting your product.
Routing decides your unit economics
Routing answers one question: what is the cheapest thing that can do this step safely? Sometimes that’s a frontier model. Often it’s not. Mature systems mix providers, multiple model classes, and non-LLM components. Easy wins include deterministic templates for repeatable replies, rules/classifiers for boilerplate, SQL for reporting, and small instruction-tuned models for extraction and formatting. Routing isn’t only technical: SLA, user tier, and blast radius should change the path through the system.
Retrieval is the contract with reality
RAG stopped being a trick and turned into an interface: what sources are allowed, how freshness is enforced, how permissions are applied, and what gets recorded for traceability. Vector databases are common; the differentiators are hybrid search (keyword + vector), permission-aware indexing, reranking, and structured retrieval from warehouses for customer state and operational metrics. Treat retrieval like data engineering: lineage, access control, and explicit service expectations.
Verification isn’t a “nice to have” anymore. Teams shipping into real workflows run post-generation checks: schema validation, policy filters, citation checks, and judge passes where appropriate. It’s not glamorous. It’s how tool-using systems avoid turning a single bad run into a support escalation, a data incident, or a broken audit trail.
Table 1: A practical comparison of compound AI deployment patterns (guidance only; outcomes depend on providers, context size, caching, retrieval quality, and tool latency).
| Approach | Typical p95 latency | Typical cost per 1k tasks | Best for |
|---|---|---|---|
| Single frontier model for all steps | High | High | Demos, early prototypes, unclear workflows |
| Router + 2–3 model tiers | Medium | Medium | Scaled SaaS flows with repeatable steps and clear SLAs |
| RAG + mid-tier model + verifier | Medium | Low-to-medium | Policy-bound knowledge work (support, IT, HR) |
| Agent with tools + sandbox + audits | Variable | Variable | High-value operations with real side effects and approvals |
| Cache + deterministic fallbacks + selective LLM | Low | Low | High-throughput experiences (search, routing, summarization) |
Agents ship in production—only the constrained ones survive
“Agent” used to mean a flashy loop that keeps calling tools until it stops. In production, the pattern that lasts is boring on purpose: a bounded worker in a narrow domain with explicit tools, explicit permissions, and a runbook. The versions that hold up look like: a refund worker restricted to certain cases, an on-call helper that drafts remediation steps but can’t deploy, a sales-ops assistant that prepares quotes under approved pricing rules and routes for sign-off.
Unbounded autonomy isn’t ambition. It’s a machine for generating incidents. The moment a system can email customers, mutate CRM fields, or touch infrastructure, you need the same discipline you apply to CI/CD: timeouts, retries, idempotency, state, and approvals where the blast radius is real. Strong implementations resemble a workflow engine (Temporal is a common pick) paired with a planner and a policy gate that can block or require confirmation on specific steps.
“You can’t manage what you can’t measure.” — Peter Drucker
The hard part isn’t the LLM. The hard part is operations: pause a run, replay it, explain it, and recover cleanly. If you can’t do that, you don’t have an agent—you have an outage waiting for a busy day.
Evaluation is how you stop arguing and start controlling behavior
“It feels better” stopped being a release standard. Serious teams treat evaluation as the control surface: offline test sets, online monitoring, and explicit mapping to business outcomes. That’s why LLM observability and eval tooling matured quickly—Datadog has expanded into this area, and Arize and Weights & Biases are common choices for tracing and evaluation workflows.
Grade the chain, not the prose
Teams that ship safely don’t score only “did the answer read well.” They track: usefulness/correctness, grounding quality (did retrieval return relevant permitted sources and are citations accurate), tool safety (blocked actions, malformed calls, attempted violations), and business outcomes (resolution, handle time, acceptance, escalation). This forces real tradeoffs into daylight: if you made it faster but increased wrong actions, you didn’t improve the product—you redistributed damage.
Your offline eval set should resemble production traffic. That means sampling real interactions (with consent), redacting sensitive data, labeling failure modes, and refreshing regularly so it doesn’t fossilize. Model providers ship frequent updates; if you can’t rerun evals on demand, you can’t detect behavior changes before customers do.
Key Takeaway
Evaluation is the steering wheel for compound AI. It’s how you route across model tiers, swap providers, and widen agent permissions without turning production into a live gamble.
Cost control comes from boring mechanics: caching, context limits, and tiering
AI spend compounds because the work is a pipeline, not a single call: intent detection, rewriting, embeddings, retrieval, reranking, generation, verification, retries, and sometimes escalation. If you only debate token price, you miss the knobs that dominate the bill.
Three moves keep paying off. Caching: many products see the same intents and the same questions on repeat; a semantic cache plus deterministic fallbacks can remove whole categories of calls. Context discipline: dumping giant blobs into prompts is expensive and often makes answers worse by drowning the model in noise; retrieval should be selective, deduped, and reranked so the model sees what it needs. Model tiering: classification, extraction, routing, and formatting belong on cheap fast models; drafting can sit on a mid-tier; verification can be small and strict; escalation should be earned by low confidence, higher stakes, or explicit user tier.
Tool access is the real security problem
Prompt injection is real, but authority is the bigger problem. If the system can call internal APIs, read customer records, or trigger payments, every user message and every retrieved document becomes a potential control input to something privileged. Treat tool access like IAM: least privilege, scoped credentials, narrow endpoints, and approvals for sensitive steps.
Defense in depth beats “one clever system prompt.” Put layers between text and side effects: allowlisted retrieval sources, permission-aware indexing, strict tool schemas, server-side validation, policy engines, and verifiers. Don’t hand an agent a generic update_customer_record and hope it behaves. Expose small, purpose-built endpoints with hard parameter validation and rate limits. Log tool calls with correlation IDs. Store inputs/outputs with redaction and retention rules that match contracts.
Regulators and procurement teams are converging on the same requirement: prove what the system did, what it used, and who it was allowed to act for. That pressure shapes architecture: tenant-level routing (by region or provider), explicit retention windows, and deletion workflows that remove data from logs, vector stores, and evaluation sets.
- Treat tools like production APIs (narrow endpoints, scoped credentials, server-side validation).
- Make retrieval permission-aware with document ACLs and strict allowlists for sensitive workflows.
- Stack guardrails: policy engine + verifier + deterministic schema checks (prompts aren’t enforcement).
- Design for audits with correlation IDs, redaction, and retention that matches contracts.
- Run prompt-injection drills like incident exercises: scheduled, documented, and repeated.
Table 2: A decision framework for selecting a compound pattern (use it in architecture reviews).
| Use case | Recommended pattern | Primary KPI | Guardrail to require |
|---|---|---|---|
| Customer support deflection | Permissioned RAG + verification + escalation | Resolution quality | Citation checks + human handoff |
| Internal IT/HR assistant | Hybrid search + ACL-enforced retrieval | Time to correct answer | Access control + redaction |
| Sales ops (quotes, CRM updates) | Agent + tool sandbox + approvals | Cycle time | Step approvals + audit trail |
| Data analysis for operators | Text-to-SQL + constrained executor | Query correctness | Read-only access + row-level security |
| Developer productivity tools | Context builder + model tiering + continuous eval | Acceptance rate | Repo permissioning + secret scanning |
A baseline architecture that holds: plan → execute → verify
If you want a compound pattern that scales without turning into a science project, build a simple state machine: “plan → execute → verify.” Vendor choices are secondary. Many teams pair a workflow engine (Temporal), an agent framework (LangGraph is one option), a vector store (Pinecone, Weaviate), and observability (Datadog, Arize). The durable part is the separation of responsibilities: planning proposes actions, execution runs tools behind enforcement, verification approves or blocks outcomes, and only then do you commit side effects.
In practice: the planner outputs a structured plan with tool calls. The executor runs each step behind server-side controls (timeouts, parameter validation, permission checks). The verifier checks grounding, citations, and policy alignment. Side effects (sending email, writing to CRM) happen after verification, not during free-form generation. That one rule prevents a long list of ugly failure modes.
# Pseudocode: plan → execute → verify loop (state machine friendly)
plan = LLM.plan(user_request, tools_schema, policy)
results = []
for step in plan.steps:
if not policy.allows(step.tool, step.args):
return escalate("Policy blocked", step)
out = tools.call(step.tool, step.args, timeout=5)
results.append({"step": step, "out": out})
final = LLM.compose(user_request, results, citations=True)
verdict = Verifier.check(final, results, policy)
if verdict.pass:
commit_side_effects(results)
return final
else:
return escalate(verdict.reason, final)The sketch is easy. The work is everything around it: replayable traces, redaction, eval sets that reflect real failure modes, and dashboards that tell you whether the system is safer—or just busier.
The question to ask before shipping any “agent”
Stop asking whether the model is smart. Ask whether the system is controllable. Can you answer, for any run: what it read, what it tried to do, what it actually did, and why it was allowed?
Pick one workflow with a crisp “done” state and draw it as route → retrieve → generate → verify → commit. For each arrow, write down three items: what gets logged, what gets blocked, and how rollback works. If you can’t fill those in, don’t ship an agent. Ship a sandbox that produces traces until you can.
- Choose a single workflow with one business KPI and one safety KPI.
- Draw the graph and highlight every point where side effects can occur.
- Instrument each step with correlation IDs, redaction, and retention rules.
- Define escalation paths for low-confidence output and sensitive actions.
- Gate changes with evals so model and prompt updates don’t surprise you in production.