Agent Reliability Launch Checklist (2026) Use this checklist to ship an agent that executes tools safely, stays within budget, and can be audited after incidents. 1) Define the workflow contract - Write a one-sentence job: “The agent will ____.” - Define success metrics (example): <1% unsafe action attempts, >90% task completion, p95 latency <8s. - Define failure outcomes that are acceptable (example): “Ask a human,” “Create a ticket,” “Return ‘unknown’.” 2) Map blast radius - List every tool/action the agent can take. - Tag each action as Read / Write / Irreversible. - For each Write/Irreversible action, define required approvals and maximum allowed impact (e.g., refund cap $50). 3) Build typed tools (before prompts) - Use strict schemas (JSON Schema or function signatures). - Enforce enums for status codes and reason codes. - Make tools idempotent (safe retries) and add rollback where possible. 4) Add policy enforcement - Implement least privilege credentials per tool. - Add a policy gate (OPA or Amazon Cedar) that checks: user identity, data sensitivity tier, action type, limits. 5) Govern retrieval (RAG) - Use permission-aware indexing (document ACLs). - Add source allowlists and sensitivity tiers (Public / Internal / Confidential / Regulated). - Store retrieval receipts: doc IDs, snippets, timestamps, identity used. 6) Create an evaluation set - Collect 200–500 real traces (anonymized) representing your core workflow. - Label outcomes: correct, incorrect, unsafe, needs-human, ambiguous. - Add adversarial tests: prompt injection, conflicting instructions, partial tool failures. 7) Build the EvalOps pipeline - Run regressions in CI on every: prompt change, tool change, model change, retrieval change. - Track: tool-call validity, action error rate, escalation rate, cost per task, p95 latency. - Add drift monitoring (weekly): compare current metrics to baseline. 8) Budget controls - Set max tool calls per task, max total tokens, and timeouts. - Add a router rule: cheap model first; escalate to expensive model only when confidence is below threshold. - Create alerts for spend anomalies (daily and monthly envelopes). 9) Observability + incident readiness - Require trace IDs for every run. - Log: prompts, structured outputs, tool parameters, policy decisions, retrieval receipts. - Ship a kill switch: disable tools, downgrade to read-only mode, or route to humans. 10) Rollout plan - Start with a shadow mode (agent proposes actions; humans execute). - Move to gated mode (agent executes low-risk actions; humans approve high-risk). - Only then enable full automation with SLOs and on-call ownership. If you can’t produce evidence for steps 2, 6, 7, and 9, you don’t have a production agent—you have a demo.