The agent outage isn’t a model bug — it’s your missing circuit breakers
The failure pattern that keeps showing up is boring and expensive: an agent gets into a loop and turns “helpful” into “unstoppable.” It reruns retrieval, repeats the same tool call with slightly different arguments, expands its own prompt, and retries until a timeout… then retries again. The customer sees a spinner. Your internal systems see a burst of first‑party traffic that looks like abuse, except it’s coming from your own product.
Classic cloud ops assumed code paths you could enumerate. Agents don’t cooperate. A single run might touch a ticketing system, an internal docs index, a billing endpoint, a repo, and a chat tool. Each hop carries its own IAM story, rate limits, data classification, and weird edge cases. A missing scope doesn’t just fail; it can provoke the agent into “trying something else” — broader queries, different tools, extra steps — which is exactly how you get boundary violations and spend spikes without a clean stack trace.
Finance has changed the conversation, too. Inference is no longer a curiosity line item; it’s an operating cost with variance driven by behavior. Two systems can ship the same feature and land in completely different places: one predictable, one chaotic. The teams that stop bleeding all end up building the same thing: a control plane between product code and model providers that makes agent behavior observable, budgeted, and auditable.
What “AI control plane” actually means: routing, enforcement, evals, and cost
“Control plane” is an overloaded term. Here’s the only definition that matters: it’s the layer that turns model usage into something you can run like production software. Not “an SDK call,” not “a prompt repo.” A set of services and contracts that decides how a request runs, what it’s allowed to touch, what it costs, and what evidence you keep afterward.
In real systems, that work collapses into four jobs: routing, policy enforcement, evaluation, and cost controls.
Routing: stop marrying one model
Hardwiring a workflow to a single frontier model is a strategic mistake and an operational risk. Model quality shifts, pricing shifts, regional availability shifts, and your customers will ask uncomfortable questions about data handling. Routing makes models swappable: pick by task and risk level, set explicit fallbacks, use small models for extraction and classification, reserve high-end models for the narrow cases that earn them.
People implement routing through cloud gateways (Amazon Bedrock, Google Vertex AI, Azure OpenAI), direct provider APIs (OpenAI, Anthropic), and orchestration layers (LangGraph, LlamaIndex, Semantic Kernel). The tooling is secondary. The non-negotiable is one interface for product teams, so provider choice and failover policy aren’t copy‑pasted into every code path.
Policy and guardrails: enforcement has to live inside the run
Agent security isn’t “put a WAF in front of it.” It’s step-by-step control over what tools can be called, under which identity, against which datasets, and what the system is allowed to store or send onward. Deterministic services often get away with boundary-only enforcement. Agentic systems don’t. You need consistent checks across retrieval, tool invocation, and generation — otherwise the agent will route around your intentions.
Some teams embed Open Policy Agent (OPA) in middleware. Others take vendor guardrails (for example, Bedrock Guardrails or Azure content filtering) and wrap everything else with internal rules. Either path works only if the policy model is explicit: allowlists, least privilege, traceable identities, and a hard line between “draft” and “execute.”
Table 1: Control-plane patterns teams keep landing on (and the tradeoffs they can’t dodge)
| Approach | Best for | Typical latency overhead | Cost/lock-in profile |
|---|---|---|---|
| Cloud gateway (Bedrock / Vertex AI / Azure OpenAI) | Central IAM, audit hooks, procurement-friendly controls | Medium | Less ops work; tighter coupling to a cloud platform |
| API proxy + observability (self-hosted) | Custom routing, multi-provider portability, bespoke enforcement | Low to medium | More engineering; more control over vendors |
| App-level integration (direct SDK calls) | Prototypes, narrow workflows, single-team ownership | Low | Fast to ship; governance and forensics degrade with scale |
| Agent framework layer (LangGraph / Semantic Kernel) | Stateful tool flows, retries, multi-step orchestration | Variable | Quick iteration; coupling risk to framework choices |
| Full “AI platform” vendor (guardrails + evals + logging) | Organizations buying speed to standardization | Medium to high | Higher subscription; faster path to shared controls |
Token economics: inference is a metered dependency, not a feature cost
Inference spend behaves like compute with a behavioral multiplier. Agents retry. Context grows. Retrieval becomes “just one more query.” Tool chains multiply. If you don’t enforce budgets and fail-closed limits, you’ve created an open meter inside production.
The metrics that matter connect usage to outcomes, not vibes: tokens per successful task, dollars per resolved ticket (or whatever your unit is), tool-call error rate, and guardrail-trigger rate (blocks, rewrites, escalations). Those numbers surface an uncomfortable truth fast: a system can look “high quality” and still be economically broken if it’s allowed to ramble and re-run.
The cost wins are mostly unglamorous engineering: keep system prompts short, cache deterministic steps, avoid re-embedding unchanged content, cap retrieval, and force structured outputs so downstream steps don’t need a second pass. Model tiering is the other big lever: small models for intent and extraction, mid-tier for drafting, and top-tier only where the risk or ambiguity earns it.
Key Takeaway
Cost control isn’t one setting. The repeatable gains come from control-plane discipline: routing, caching, retrieval caps, and budgets that degrade safely instead of detonating.
Evals aren’t research anymore — they’re release gates
Prompt tweaking falls apart under real churn: model updates, index updates, tool changes, policy changes. If you can’t catch regressions automatically, you’ll ship regressions automatically.
The mature pattern looks like release engineering: prompts, tool schemas, and policies are versioned artifacts; representative tasks are captured as a golden set (redacted); and CI blocks merges when success rates or policy compliance drop beyond an agreed threshold. This is most critical in workflows where a small failure is expensive: customer support, code changes, incident response, and anything that can trigger external actions.
Metrics worth tracking (and the ones that lie)
Track what maps to reality: task success, tool-call correctness, policy compliance, and time-to-resolution. Generic “response similarity” scores are easy to compute and often meaningless. Force structure whenever you can: JSON schemas, typed actions, function calls, and validations that fail loudly. If you use an LLM as a judge, treat it like a dependency: anchor it with references, do spot checks, and track disagreement so you notice drift.
“You can’t improve what you don’t measure.” — Peter Drucker
Table 2: A control-plane checklist for shipping agents without surprises (build order matters)
| Control | Owner | Minimum bar | Signal to monitor |
|---|---|---|---|
| Model routing policy | Platform Eng | Multiple tiers/providers; explicit fallbacks | Provider error rate; cost per outcome |
| Prompt + tool versioning | App Eng | Prompts, schemas, policies in source control | Rollback frequency; change-linked regressions |
| Evals in CI | ML/AI Eng | Golden set + gating on merges | Pass rate trend; judge drift signals |
| Budget + rate limits | SRE/FinOps | Per-user/workflow caps; safe degradation paths | Spend anomalies; long-tail run time |
| Policy enforcement (DLP + tool auth) | Security | Least-privilege tool tokens; retrieval allowlists | Blocks/rewrites; boundary exceptions |
Compliance now lives in “agent permissions,” not a shared API key
Agents break an old comfort: humans had intent, services had constraints. Agents behave like software that invents its own next step. That forces a permission model that’s closer to workflow IAM than “this service account can call the CRM.” The workable design is granular permissions per step, explicit scopes, and full traces you can hand to audit without hand-waving.
Example: a sales ops agent can read opportunities and draft an email, but cannot send it. It can cite pricing docs, but cannot export a customer list. It can call a discount calculator, but cannot change contract terms. The rule is simple: split “generate” from “execute,” then require a human or an approval policy for execution in high-risk domains.
Compliance follows the same shape. “In-region hosting” doesn’t solve retention, redaction, or audit requirements. Many enterprises now expect run-level forensics: what context was retrieved, which tools were called, what outputs were produced, tied to identity and timestamps. If you can’t produce that trace, procurement will treat your agent as a lab demo with a UI.
A control plane you can ship this quarter (without a re-platform)
You don’t need a grand rebuild. Start by forcing all model calls through one door, then add the controls that stop the bleeding: traces, budgets, and policy checks on the workflows that can hurt you. Once those primitives exist, you can swap models, prompts, and tools without rewriting every product path.
A practical v1 for a small-to-mid sized org is straightforward:
- One gateway for all model calls, even if it begins as a thin proxy to one provider.
- Standard traces: prompt and tool versions, retrieved doc IDs, tool calls, token counts, latency, and user/org identity.
- A retrieval contract: hard limits, required citations for high-stakes outputs, and explicit indexes per workflow.
- Budgets and circuit breakers: caps on retries, tool calls, tokens, and wall-clock time, plus defined degradation paths.
- An eval harness: start with a small golden set, then feed it from real failures.
Many teams implement the first cut as a simple HTTP service that normalizes requests, applies routing rules, and enforces limits. The syntax is optional; the separation of concerns is not:
# pseudo-config for an internal AI gateway (2026 pattern)
routes:
- name: support_triage
models:
primary: gpt-4.1-mini
fallback: claude-3.7-sonnet
max_tokens: 1200
max_tool_calls: 6
retrieval:
max_chunks: 6
allow_indexes: ["zendesk_kb", "internal_runbooks"]
policies:
pii_redaction: true
disallow_actions: ["send_email", "refund_customer"]
- name: contract_review
models:
primary: gpt-4.1
fallback: claude-3.7-opus
max_tokens: 4000
require_citations: true
approvals:
on_execute: "legal_ops"The YAML isn’t the product. The product is the contract: application teams name intent (for example, contract_review) and the control plane decides how that intent runs safely, within budget, with evidence you can audit later.
Ownership: if it’s everyone’s job, it won’t exist
A control plane is an org choice pretending to be architecture. Put it only in Platform and it can drift into “no exceptions.” Put it only in ML and it can drift into “cool demos, weak ops.” The pattern that sticks is a small internal product team with clear SLAs and a mandate to make application teams faster while still enforcing non-negotiables.
The predictable failure mode is the “AI platform toll booth.” Centralize too hard, move too slowly, and teams will route around you by calling providers directly. That’s when budgets leak, logs fragment, and security loses traceability. The fix isn’t more rules. The fix is a paved road: a good SDK, defaults that make the right thing easy, and fast turnaround for exceptions.
Next action: pick one workflow that can burn money or break trust and put it behind a gateway with (1) a trace ID, (2) a budget, and (3) tool allowlists this sprint. If you still can’t answer “what did it do, what did it cost, and what data did it touch?” you’re not operating an agent. You’re running an uncontrolled production experiment.
Key Takeaway
If you can’t reconstruct an agent run end-to-end — inputs, retrieved context, tool calls, outputs, identity, and cost — you don’t have something you can govern. You have a liability that happens to speak in sentences.