Agent Reliability Readiness Checklist (2026) Use this checklist to move from “agent demo” to “production operator” with bounded autonomy, predictable cost, and auditable behavior. 1) Pick one workflow with a denominator - Define the unit: per ticket, per invoice, per lead, per incident. - Write a single sentence definition of success and failure. 2) Define the action envelope (bounded autonomy) - List allowed tools and disallowed tools. - Decide read-only vs write-enabled phases. - Set maximum number of tool calls per task (e.g., 10) and maximum model calls (e.g., 20). 3) Establish identity + least privilege - Use scoped OAuth/service accounts per environment. - Implement allowlists (domains, projects, repositories, ticket queues). - Require idempotency keys for any write action. 4) Set a cost budget per completed task - Choose a p95 target (example: $0.25 per support resolution; $2.00 per internal ops task). - Track retries and tool costs; don’t measure tokens alone. - Add hard caps: stop and escalate if budget is exceeded. 5) Instrument end-to-end tracing before shipping - Generate a trace_id per task. - Record: prompts, retrieved docs metadata, tool calls (args + results), and final actions. - Ensure every tool execution is linked to a trace_id (block writes without it). 6) Build an initial eval suite - Curate 50–100 “gold” tasks from real historical cases. - Add adversarial tests: prompt injection, data exfiltration attempts, and refusal edge cases. - Define metrics: task success, escalation correctness, undo rate, p95 latency, p95 cost. 7) Add schema validation for structured outputs - Use JSON Schema/Pydantic validation. - Reject invalid tool arguments and force a corrected retry. - Prefer narrow tools (purpose-built endpoints) over general “execute anything” tools. 8) Add verification + gating before any write - Policy-as-code checks (tenant, data class, destination, amount caps). - Consistency checks (e.g., refund amount <= cap; email contains required footer). - Optional second-pass critique for high-stakes steps. 9) Implement human fallback paths - Define who gets escalations and the SLA (e.g., <5 minutes internal). - Provide “dry-run diffs” and approvals for early launch. - Track undo actions and require root-cause notes for reversals. 10) Stage rollout with feature flags - Start with internal users, then 1–5% of customers, then expand. - Gate expansion on metrics: success rate, undo rate, and cost p95. - Keep kill switches for: tool writes, external messaging, and spending actions. 11) Run postmortems like an SRE team - For any high-severity error: reconstruct the trace, replay, document the root cause. - Track incident rate per 1,000 tasks. - Add the failure case to the eval suite to prevent regressions. 12) Plan for provider/model portability - Separate orchestration and evaluation from model providers. - Maintain a routing strategy (cheap model for extraction; strong model for synthesis). - Re-run eval suites before any model, prompt, tool, or retrieval index changes. If you can’t confidently pass steps 1–6, you’re not ready for autonomous writes. Treat reliability as the product: budgets, policies, tracing, and evals are what make agents shippable in 2026.