The agent that breaks prod usually isn’t “wrong”—it’s unauthorized
The loudest failures in production agents don’t look like bad writing. They look like a tool call that should never have been possible: a refund issued from the wrong tenant, a deployment triggered outside change windows, a support macro sent with unredacted PII. That’s not “AI safety.” That’s basic identity and control-plane work you’d do for any service that can mutate state.
In 2023–2024, most teams shipped LLMs as interface sugar: draft text, summarize, autocomplete. By 2026, the work moved to execution: plan steps, call tools, commit changes, and loop until an outcome is reached. The practical difference is brutal. A chatbot can be flaky and still feel helpful. An agent that clicks buttons or moves money has to be boring: authenticated, authorized, observable, rate-limited, and budgeted.
The product ecosystem nudged developers in this direction. OpenAI’s Assistants/Responses APIs made tool calling the default path instead of prompt spaghetti. Anthropic’s tool use patterns and “computer use” demos pushed the idea that models can operate across real interfaces. Microsoft, Salesforce, and ServiceNow all market “agentic” work as a platform feature tied to tickets, cases, and approvals—not clever paragraphs. Once autonomy maps to business outcomes, it stops being a demo and starts being something Finance, Security, and Compliance can interrogate.
Architecture that survives contact with reality: orchestration, tools, and durable state
If your “agent” is a single LLM loop that keeps its memory in a prompt, you built a prototype. Production work has retries, partial failures, idempotency, timeouts, and humans who go offline. You need a system that can stop mid-flight, persist progress, and resume without improvising.
The stable pattern has three layers. First: orchestration you can replay. A workflow engine or state machine (Temporal, AWS Step Functions, Azure Durable Functions, Google Workflows) gives you checkpoints, retries, and explicit state transitions. Second: tool adapters with strict schemas. Every integration—GitHub, Jira, Stripe, Snowflake, SAP, Zendesk—needs typed inputs/outputs and real error semantics. Don’t let a model manufacture side effects with stringly-typed “commands.” Third: durable state. Keep “what happened” in an append-only log for audit and replay; treat retrieval (docs, embeddings, vector search) as a separate convenience, not the system of record.
One contrarian point: “multi-agent” is often a way to avoid designing boundaries. A single agent with a deterministic workflow and tight tool contracts is easier to secure and cheaper to run than a swarm debating in circles. Multi-agent setups earn their keep only when duties truly conflict—one proposes actions, another enforces policy, another handles communications. Even then, the orchestrator should be the decider and the policy engine should be the judge.
Teams that ship this consistently write an explicit agent contract: allowed tools, approval gates, maximum spend per run, timeouts, and required observability hooks. Autonomy without a contract is just undefined behavior with a friendly interface.
Table 1: Production orchestration options teams use for agents
| Approach | Strength | Typical agent use case | Operational trade-off |
|---|---|---|---|
| Temporal | Durable execution with replayable history | Long-running workflows that need retries, approvals, and resumability | More design discipline up front; engineers must respect workflow constraints |
| AWS Step Functions | Managed state machines with tight AWS integration | Agents coordinating AWS-native services and event-driven steps | State graphs can get verbose; non-AWS integrations need extra plumbing |
| Kubernetes + event bus (Argo/Knative) | Portable and flexible for platform teams | High-throughput routing, triage, enrichment, and batch processing | Higher ops overhead; durability semantics are easy to get wrong |
| In-app workflow engine (e.g., BullMQ/Celery) | Fast iteration with minimal new infrastructure | Early agent features embedded directly in product flows | Harder to guarantee replay, audit trails, and long-running correctness |
| SaaS automation (Zapier/Make/n8n) | Quick integrations across common SaaS systems | Prototyping and low-stakes back-office automation | Limited governance and testing; complex retries and audits can be painful |
Identity and permissions: “Which principal is acting?” is the whole problem
An agent that can take action must act as an identity. Treat that identity like a production service account—except with tighter auditing because the “business logic” is probabilistic and the inputs can be hostile. If the agent can change configs, touch customer data, or move money, you don’t grant it broad access and hope prompts keep it polite.
The clean pattern is one agent identity per workflow (and often per tenant), with least-privilege grants per tool. Example: a dispute-resolution agent can read charges and open a support case, but cannot perform write actions above a policy threshold without approval, cannot export bulk data, and cannot switch accounts. Implement this with your existing IAM (AWS IAM, Google Cloud IAM, Azure Entra ID) and put fine-grained decisions in a policy layer such as Open Policy Agent or Amazon Cedar.
Why scopes don’t model reality
OAuth scopes are blunt and static. Agents need authorization that depends on context: amount, geography, customer tier, incident severity, data class, time window, and whether a human signed off. Policy-as-code is the only approach that scales because it’s explicit, testable, and reviewable like any other change.
Audit trails aren’t paperwork—they’re the feature
Buyers that operate under regulation or serious security review ask for evidence before they ask for “accuracy.” They want to see what context the agent saw, which tools it called, what the policy engine decided, and which approvals were required. If you can’t replay a run end-to-end, you can’t defend it to an auditor or a customer.
“If you can’t explain it simply, you don’t understand it well enough.” — Albert Einstein
Guardrails that work under attack: sandboxing, validation, and engineered approvals
A “safety prompt” is not a control. Any agent that reads tickets, email, or chat will be prompt-injected. Any agent that scrapes the web will ingest hostile text. Any agent that calls tools will see weird outputs and brittle failure modes. Design like you’re building a payments system: distrustful boundaries and strict verification before side effects.
Start with sandbox-first execution. If an action can be previewed, planned, or dry-run, make that the default. Generate a machine-checkable action proposal (typed payloads, explicit diffs), then run automated checks: schema validation, policy checks, dependency checks, and budget checks. This is how you keep “the model decided” from becoming your root-cause analysis.
Human checkpoints still matter, but bolt-on approvals slow teams down and don’t prevent mistakes. Build approval tiers into the workflow so the agent can keep working: gather evidence, produce a concise diff, open the approval request in the right system, and wait. Low-risk actions can run unattended; high-risk actions require explicit sign-off; irreversible actions should never be one-click for an agent.
Key Takeaway
If you can’t write your guardrails as enforceable checks—schemas, policies, sandbox modes, and approval tiers—you don’t have guardrails. You have optimism.
Verification should be redundant. Use deterministic validators wherever possible (allowed commands, bounds checks, idempotency keys, known-safe templates). For sensitive steps, add an independent verifier—rules, a separate model, or both. Spending extra compute on verification is cheaper than cleaning up a bad deploy or a policy breach.
Economics: stop pricing tokens; start pricing outcomes
Finance will kill your agent program if you can’t answer one question: what did this automation buy us per unit of work? “Cost per million tokens” is vendor math, not operator math. Track cost per outcome: a ticket closed, an invoice matched, a lead enriched, a PR reviewed, a case escalated with the right evidence. That’s the only framing that survives budgeting.
Where costs actually come from: model choice, how much context you shove into prompts, how many verification calls you add, and how often you re-run work because the system isn’t durable. Mature stacks route routine steps to smaller models and reserve frontier models for the ambiguous parts. They also cache stable artifacts: embeddings, structured outputs, and retrieval results that don’t change minute-to-minute.
- Track cost per successful run, including retries, verification, and human rework—not just cost per request.
- Put budgets on workflows and make over-budget behavior explicit: degrade gracefully, escalate, or stop.
- Separate experimentation from production with different keys, limits, and retention rules.
- Measure deflection and resolution separately; deflection that drives churn is a hidden loss.
- Roll out autonomy in stages and watch error and escalation rates like you would for any release.
One more hard truth: token cost is usually not the real risk cost. The expensive failures are security incidents, compliance violations, and customer trust hits. That’s why high-risk domains (access control, payments, production deploys) should stay conservative until the audit trail and evals earn expanded permissions.
Table 2: Checklist for deciding how autonomous a workflow can be
| Workflow attribute | Low-risk signal | High-risk signal | Recommended autonomy |
|---|---|---|---|
| Financial impact per action | Low and capped by policy | High or uncapped exposure | Autonomous only under thresholds; approval above |
| Reversibility | Easy rollback and clear diffs | Irreversible or hard to unwind | Require human checkpoint for irreversible actions |
| Data sensitivity | Non-sensitive content | PII/PHI/PCI or regulated data | Constrain tools; tighten redaction, retention, and approvals |
| Error detectability | Failures caught by automated checks | Failures discovered late by customers | Staged rollout with higher verification and stricter gating |
| Tool maturity | Stable APIs and idempotent writes | Brittle UI automation or scraping | Prefer APIs; gate UI actions behind approvals |
Observability: treat every run like a distributed trace
Once agents touch real systems, you debug them like microservices—except you’re also dealing with nondeterminism and data privacy. Production-grade agent observability includes: prompt and tool-call traces, structured event logs, cost telemetry, and outcome scoring. Teams often stitch this together with OpenTelemetry plus vendor tools (Datadog, New Relic, Grafana) and model-focused platforms (LangSmith, Weights & Biases) depending on their stack.
The right mental model is simple: every agent run is a trace with spans for retrieval, inference, tool calls, retries, and approvals. Store enough to replay and answer “why did it do that?” but don’t dump raw prompts into logs without a plan. Prompts contain secrets, customer data, and internal context. Use redaction, tiered retention, and encrypted break-glass access for incident work.
Evals are a release gate, not a research project
Agents rot because their environment changes: APIs shift, docs drift, policies update, customer behavior evolves. Offline evals matter, but continuous evals are what keep you out of incident channels. Keep a scenario bank that reflects production inputs, score it regularly, and run an adversarial pack for prompt injection, malformed tool responses, and policy bypass attempts. Swap a model or edit a tool schema without evals and you’re shipping blind.
Incident response needs to be explicit: kill switches by workflow and tenant, rate limits, spend caps, and a clean way to answer four questions fast—what context was used, what policy allowed the step, what tool call executed, and what check failed to catch it. Autonomy increases the value of this discipline; it doesn’t reduce it.
# Example: minimal agent run record (JSONL) for audit + replay
{
"run_id": "run_2026_05_02_183012Z_9f31",
"agent": "billing-dispute-v3",
"tenant_id": "acme_co",
"model": "frontier-2026-02",
"budget_usd": 0.40,
"tool_calls": [
{"tool": "stripe.lookup_charge", "input_hash": "baf...", "status": "ok", "latency_ms": 184},
{"tool": "zendesk.create_ticket", "input_hash": "1ce...", "status": "ok", "latency_ms": 412}
],
"approvals": [{"type": "refund_threshold", "required": true, "approved": false}],
"outcome": {"status": "escalated", "reason": "amount_exceeds_threshold"},
"cost": {"prompt_tokens": 4120, "completion_tokens": 980, "usd": 0.27}
}
Rollout discipline: treat an agent like a new employee category
If you want agents in production, stop starting with the flashiest workflow. Start with boring, high-volume work that’s easy to audit and easy to undo: ticket triage, CRM enrichment, invoice matching, PR review, incident summarization. Give the system a narrow tool surface, measure one outcome, and earn broader permissions over time.
Rollout isn’t just engineering. Security signs off on identities and logging. Legal signs off on data use and retention. Finance sets spend caps and chargeback. Ops owns escalation paths. Frame the agent like a new role with a manager chain: what can it do on day one, what does it need to request, where does it wait for approvals, and how do you fire it instantly if it misbehaves?
- Choose one outcome metric that maps to throughput (cycle time, resolution rate, review latency).
- Build a typed tool surface with strict schemas, validation, and idempotent writes.
- Create a dedicated agent identity with least privilege and policy thresholds for sensitive actions.
- Instrument traces and cost from the first run; log for replay, redact by default.
- Ship autonomy in stages: draft → suggest with approval → autonomous under thresholds.
- Gate changes with evals every time you touch a model, prompt, retriever, tool adapter, or policy.
A prediction worth betting your roadmap on: the companies that win with agents won’t be the ones with the most impressive demos. They’ll be the ones that can answer, instantly and convincingly, “What did the agent do, under what permissions, under what policy, and can we replay it?” If you can’t answer that yet, pick one workflow and build the control plane first.