Agentic Reliability Stack Checklist (2026) Use this checklist before you grant an AI agent tool access in production. The goal is simple: predictable outcomes, controlled blast radius, and audit-ready operations. 1) Define the job and success metric - Write a one-sentence job statement (e.g., “triage inbound support tickets”). - Define a task success metric (e.g., % correctly routed) and an error budget (e.g., <1% critical policy violations). 2) Map autonomy boundaries - List every tool/action the agent can take. - Categorize each action as Auto / Gated / Human-required based on financial impact, reversibility, and customer visibility. 3) Design constrained tools - Prefer narrow, typed tools (purpose-built endpoints) over generic “run SQL” or “send email” tools. - Version tool schemas; treat breaking changes like API changes. 4) Establish agent identity - Choose agent-as-service-account vs agent-on-behalf-of-user. - Enforce least privilege and short-lived credentials (minutes, not days). 5) Add policy checks outside the model - Implement pre-execution policy gates (role, tenant, amount caps, time windows, sanctions). - Fail closed: if checks fail, do not execute. 6) Build an eval harness - Create a golden dataset of real tasks (50–500 to start). - Score end-to-end success, policy compliance, and tool-call correctness. 7) Add red-team scenarios - Include prompt injection, data exfiltration attempts, and cross-tenant access tests. - Track exploit rate over time; require improvements before expanding permissions. 8) Instrument tracing and replay - Log prompts, retrieved sources, tool calls, inputs/outputs, approvals, and final actions. - Ensure you can replay a run for debugging within 24 hours. 9) Set production SLOs - Latency SLO (median and p95), cost/run, cost/success, and success rate. - Alert on regressions after model, prompt, or tool changes. 10) Control cost proactively - Use model routing (small model for extraction/routing; large model only when needed). - Cap loops, retries, and maximum tool calls per run. 11) Roll out safely - Shadow mode (agent recommends; humans execute) for 2–4 weeks. - Feature flag: 1% → 10% → 50% → 100% as evals and SLOs hold. 12) Prepare incident response - Assign on-call ownership. - Implement a kill switch to disable tool execution instantly. - Document rollback procedures for each write action. If you can’t check off at least 10 of 12 items, your agent is still an experiment—treat it accordingly.