AGENT RELIABILITY LAUNCH CHECKLIST (2026) Use this checklist to ship a tool-using agent that’s safe, measurable, and economically predictable. 1) DEFINE THE JOB - Write a one-sentence “definition of done” for the agent’s primary workflow. - List systems of record it will touch (e.g., Jira, GitHub, Salesforce, AWS). - Identify irreversible actions (deletes, refunds, permission changes) and mark them as Tier 3–4. 2) CREATE A RISK TIER POLICY - Tier 0: read-only. Tier 1: low-risk writes. Tier 2: business-impacting writes. Tier 3: prod/financial. Tier 4: regulatory/irreversible. - For each tier, define required controls: allowlists, approvals, audit logs, rollback plans. 3) DESIGN THE EXECUTION ARCHITECTURE - Separate planning from execution: planner proposes steps; executor runs tools. - Require “preview/diff” output for any write action. - Add deterministic validators (schema checks, regex/allowlists, policy engine like OPA). - Set hard limits: max tool calls, max retries, max wall time. 4) PERMISSIONS AND IDENTITY - Create a dedicated identity per agent (no shared API keys). - Apply least privilege: only the minimal scopes/endpoints required. - Store secrets in Vault/KMS; rotate on a schedule. 5) EVALUATION SUITE (BEFORE PRODUCTION) - Build 50–200 representative tasks with expected outcomes. - Include edge cases: partial data, ambiguous inputs, tool timeouts, permission errors. - Define failure taxonomy: auth, tool error, parsing, planner error, policy violation. - Gate releases on regression thresholds (e.g., Tier 1 success rate >85%, Tier 3 policy violations ~0%). 6) OBSERVABILITY AND AUDITABILITY - Log: prompt/response (with redaction), tool inputs/outputs, latency per step, final state changes. - Add tracing IDs so every run is replayable end-to-end. - Build dashboards: success rate, p95 latency, cost per success, tool-call p95, top failure reasons. 7) COST AND LATENCY CONTROLS - Implement model routing: small model for extraction/formatting, larger model for planning. - Cache stable retrieval/tool results where freshness allows. - Fail fast on repeated identical errors; don’t allow infinite retries. - Track cost per successful task and cap worst-case run cost. 8) ROLLOUT PLAN - Phase 1: shadow mode (agent proposes; human executes). - Phase 2: automate Tier 0–1 with audit sampling. - Phase 3: gated Tier 2+ with explicit approvals. - Keep a kill switch and rollback mechanism for every workflow. 9) OPERATIONS - Write runbooks for common failure modes (tool down, auth revoked, model drift). - Define on-call ownership for the agent service. - Hold postmortems for any incident involving agent actions. Exit criteria: You can prove (with data) that the agent hits success targets, stays within cost/latency budgets, and cannot execute high-risk actions without the required policy gates and approvals.