Most agent incidents aren’t “AI problems” — they’re permissions problems
The recurring pattern behind messy agent rollouts is boring: an agent got access it didn’t need, executed a tool call nobody expected, and the team had no trace that explains what happened. Then the postmortem turns into a prompt review instead of an access review.
By 2026, “agentic AI” isn’t shorthand for a chat UI with a couple of tools. It’s software that plans multi-step work, touches real systems, and tries again when it fails. That puts it in the same category as workflow automation, service accounts, and oncall ownership—not product copy.
Two things made this operationally feasible: tool calling got reliable enough to trust in controlled lanes, and orchestration patterns matured so teams can run agents as workflows with retries, timeouts, and audits. The open question isn’t whether to ship agents. It’s which workflows deserve autonomy, and which should stay in “propose mode” forever.
There’s also a money reality: once an agent runs continuously, cost stops looking like seats and starts looking like runtime. Tokens are only part of it; tool calls, retrieval, and eval pipelines all show up on the bill. If finance can’t understand the spend model, your agent program won’t survive its first incident.
The production agent stack is five layers — treat them like fault domains
“Agent” is a marketing word. In production it’s a stack with separate owners and separate failure modes: (1) model runtime, (2) orchestration, (3) tools, (4) memory and retrieval, and (5) controls (auth, policy, evals, observability). Blending these layers into one codebase is how teams end up with outages they can’t isolate.
Frameworks such as LangGraph are popular because they force you to write the flow as a state machine with explicit branches and human handoffs. Workflow engines such as Temporal optimize for durable execution, deterministic replay, and a clean audit trail. Pick based on your blast radius: if an agent can change customer state, move money, or touch production infrastructure, you’ll want replayable workflows and strict idempotency.
Tool access is the perimeter now
The riskiest part of an agent is not the text it generates. It’s the endpoints it can hit. A wrong answer in a chat is annoying; a wrong call to a billing, CRM, or admin API is an incident. Operators that stay sane do three things consistently: scope tools per role, enforce typed inputs (JSON Schema or OpenAPI), and gate high-risk actions behind deterministic checks or human approval. Free-form “stringly-typed” tools are how you get surprise SQL, surprise emails, and surprise refunds.
Memory is a policy choice, not a feature checkbox
Most memory failures are governance failures: saving the wrong data, for too long, in the wrong store. A practical split is: ephemeral scratchpad (not retained), user-approved long-term preferences (explicitly managed), and immutable operational logs of actions (retained for audit). Mixing those three guarantees either privacy headaches or unusable personalization.
Table 1: Common orchestration options for production agents (operator view)
| Approach | Best For | Strength | Trade-off |
|---|---|---|---|
| LangGraph (LangChain) | Iterating on stateful agent flows | Clear branching, retries, human review nodes | Audit/replay requires extra plumbing |
| Temporal | Durable workflows with strict correctness needs | Deterministic replay, retries, operational visibility | More engineering discipline; LLM calls must be made safe to retry |
| AWS Step Functions | AWS-native orchestration under IAM governance | Managed scaling, visual workflows, strong identity integration | Can get expensive and noisy at high state transition volume |
| Custom (event-driven + queues) | Tight constraints or legacy-first environments | Full control over runtime, storage, and policies | You own tracing, evals, and every operational sharp edge |
| Microsoft Copilot Studio + Power Platform | Microsoft 365-centric organizations | Fast rollout with governance hooks and connectors | Limited flexibility for bespoke systems and deeper controls |
ROI only shows up when you measure the workflow, not the model
If your success metric is “the agent answered,” you’re measuring theater. The only numbers that matter are operational: time-to-resolution, cost per case, conversion throughput, incident rate, and how often humans need to intervene. Agents don’t live in a sandbox; they live in permission boundaries, messy data, and exception paths.
That’s why the strongest deployments cluster around workflows that already have instrumentation: support intake and routing, sales ops data hygiene, internal oncall assistance, and back-office processing with clear definitions of “done.” A practical rule: if your workflow doesn’t have a clean baseline, you can’t claim improvement—so build the baseline first.
Cost discipline is where serious teams separate from demo teams. Runtime spending is the obvious part. The hidden part is what makes the program stable: evaluation suites, trace storage, review queues, and the engineering time required to keep tool contracts from drifting. If you don’t budget for that work, you end up paying in incidents and rollbacks.
“You don’t get to opt out of governance. You can only decide whether it’s designed or accidental.” — Meredith Whittaker
Governance: treat each agent as a privileged identity
Once an agent can open Jira tickets, read customer records, change billing, or run deploy steps, you’ve created a new actor in your environment. Handle it like a service account: least privilege, secret rotation, environment separation, and audit logs you can defend.
A baseline that holds up under scrutiny is simple and strict: each agent has a role with a permission manifest; each tool is typed and validated; high-risk actions are gated by deterministic checks or explicit approvals; every action is logged with enough context to reproduce the decision. If you can’t reconstruct an “explainable trace” (inputs, retrieved references, tool calls, policy decisions), you can’t debug—and you definitely can’t audit.
Policy engines and sandboxes are the control primitives that matter
Teams that run agents safely put a deterministic policy layer between the model and tools. Open Policy Agent (OPA) is a common choice in Kubernetes-heavy stacks; Cedar is used where teams want policy-as-code with tight authorization semantics. The pattern is consistent: the agent proposes, the policy decides. Anything that mutates state can be forced through approval thresholds, environment rules, or denial lists.
Sandboxes are the other half of the story. If an agent generates code, queries, or config, it should run in an isolated environment first and move through CI/CD like any other change. If the agent can’t be constrained to safe lanes, it doesn’t belong in production automation.
Key Takeaway
If an agent can change state, govern it like a service account: least privilege, deterministic policy checks, and complete audit trails. Prompts are not access control.
One detail that keeps biting teams: “human approval” fails if the review queue is designed like an email inbox. Keep batches small, show diffs, attach risk scores, and make it easy to say “no” quickly. The point of review is to catch edge cases, not to rubber-stamp automation.
Evals and observability aren’t optional — they’re how you operate stochastic systems
Shipping agents by eyeballing outputs is a great way to build a demo and a terrible way to run production. Agents are probabilistic decision-makers wired into deterministic systems. You must test both: whether they choose the right actions and whether those actions are safe.
Strong eval programs usually look like three layers: unit-style checks (schema validity, tool-call correctness, policy compliance), scenario suites (end-to-end workflow outcomes), and adversarial tests (prompt injection, data exfiltration attempts, tool misuse). Treat hostile inputs as the default, not the exception.
Log traces that help oncall without turning logs into a liability
“Log everything” is how teams create a privacy incident while trying to prevent an agent incident. Prefer structured traces with redaction and hashing for sensitive fields. Log tool names and outcomes, policy decisions, latency, token usage, and retrieval IDs—not raw document bodies or customer data. Many teams keep two streams: a short-retention operational trace for debugging and an immutable, minimal compliance ledger for audits.
# Example: minimal agent trace event (JSONL)
{
"ts": "2026-04-10T03:14:22Z",
"agent_id": "support-triage-v3",
"session_id": "a1f8...",
"model": "gpt-4.1-mini",
"retrieval": {"index": "kb-prod", "doc_ids": ["KB-1821", "KB-4470"]},
"tool_call": {"name": "crm.updateCase", "args": {"caseId": "C-88319", "priority": "P2"}},
"policy": {"decision": "allow", "rule": "case_priority_write"},
"result": {"status": "ok"}
}
Alerting should be tied to harm and risk, not vibes. Alert on spikes in policy denials, tool-call error rates, runaway retries, abnormal cost per run, or drift in outcome distributions. If an oncall engineer can’t answer “what happened and what do we do next?” from the dashboard, the system isn’t operable.
Cost and latency decide which agents survive contact with reality
Agent systems don’t scale like per-seat SaaS. Spend and latency climb with tokens, tool calls, retrieval, and retries. The winners are rarely the teams with the fanciest model; they’re the teams that design flows that avoid thrash.
Start by routing work: small models for classification and extraction, larger models only for the steps that actually need them. Replace open-ended reasoning with structured outputs and verification steps. Cache what can be cached (policy docs, account status) with clear invalidation rules. Put rate limits and backpressure in front of flaky dependencies so an external outage doesn’t turn into an expensive retry storm.
Table 2: Production readiness checklist for deploying an agent into a core workflow
| Area | Minimum Standard | Owner | Go/No-Go Signal |
|---|---|---|---|
| Permissions | Least-privilege role with scoped tool access | Security/Platform | No production mutation outside an explicit allowlist |
| Policies | Deterministic gates for high-risk actions | Security + Product | Refund/PII/infra actions require policy approval or a human step |
| Evals | Regression suite plus adversarial tests | ML/Eng | Stable behavior across model/tool/index changes |
| Observability | Traces, spend metrics, tool error rates | SRE | Oncall can diagnose a failed run quickly from dashboards |
| Fail-safes | Timeouts, circuit breakers, safe fallback | Platform | Dependency outages don’t trigger runaway retries or spend spikes |
Latency is UX. If the user is staring at a spinner, trust drops fast. For customer-facing experiences, design explicitly async flows: background runs, progress updates, and confirmations for state changes. For internal agents, longer runtimes can be fine—if the trace is good and the failure modes are obvious.
Rollout: don’t “deploy an agent,” introduce a new operator into the org
The cultural failure modes cut both ways: nobody trusts the agent, or everyone trusts it blindly. Treat adoption like introducing a new ops role. Define scope, escalation paths, and what happens when the agent hits ambiguity. Make reporting failures easy, and make investigation fast.
Start with a workflow that is high-volume, low-risk, and already measured: triage, routing, enrichment, backlog grooming. Graduate to controlled writes: draft changes, stage updates, propose refunds, open PRs. Autonomous production writes are the final step, and only behind policy gates and spend caps.
- Choose a workflow with clean inputs and an auditable definition of “done” (your KPI should already exist).
- Define tool contracts (typed schemas, strict allowlists, sandbox endpoints where possible).
- Launch in propose mode (agent drafts; a human or policy gate approves).
- Ship with evals and a rollback path (regression suite, adversarial tests, kill switch).
- Expand permissions deliberately (read-only → staging writes → narrow production writes).
- Review spend and SLOs on a fixed cadence (cost per run, tool error rate, outcome drift).
Operational ownership has to be explicit or the system becomes untouchable. Security owns policy. SRE owns uptime and spend anomalies. Product owns acceptable risk and user impact. Engineering owns tool contracts and failure handling. If those names aren’t written down, every change becomes a fight.
- Name a single DRI per agent with authority to ship fixes and pull the kill switch.
- Publish a permission manifest the same way you would for any privileged service identity.
- Set hard spend caps per run and per day, tied to alerts well before the cap hits.
- Make failure reportable: one click creates an issue with trace IDs attached.
- Run postmortems for agent incidents with concrete follow-ups, not prompt blame.
If you want a real test of readiness, ask this: could a new oncall engineer debug a bad agent run using only the trace, the policy decision log, and tool-call history? If the answer is no, that’s your next sprint.