Agent Reliability Launch Checklist (2026) Use this checklist to move from an agent prototype to a production deployment with measurable reliability. Target: one workflow shipped with clear autonomy boundaries in 30 days. 1) Define the workflow and success criteria - Write a one-sentence “job”: what the agent does and where it operates (e.g., “Resolve order-status tickets and issue refunds up to $50”). - Define Task Success Rate (TSR) with an objective outcome (ticket closed with correct tags; refund amount matches policy). - Set non-functional SLOs: p95 latency (e.g., <= 5s), cost per task (e.g., <= $0.10), and max failure rate. 2) Scope tools and permissions (least privilege) - Inventory required tools/APIs and classify by risk: read-only, reversible write, irreversible/regulatory. - Create a dedicated agent identity (service principal) and issue scoped, short-lived credentials. - Add allowlists for tools and destinations (domains, repos, projects) and hard rate limits. 3) Implement budgets and stop conditions - Enforce max tokens, max tool calls, and max wall time per task. - Add loop detection (repeat tool call patterns, repeated failures) and a forced “escalate” outcome. - Define “safe fallback” behaviors when budgets are exceeded (hand off to human, respond with status-only output). 4) Add policy-as-code gates - Express rules in a testable engine (OPA/Cedar or equivalent), not only in prompts. - Required rules: action caps (e.g., refund <= $50), prohibited actions, PII handling, logging constraints. - Ensure every decision is logged: policy version, rule ID, allow/deny, justification. 5) Build an evaluation suite (deployment gate) - Curate 200–500 representative tasks from real history; remove sensitive data. - Add 30–50 adversarial cases: prompt injection, ambiguous instructions, conflicting policies. - Implement automated checks: schema validity, tool argument correctness, policy compliance. - Add human review for a sampled set and for all high-risk tiers. 6) Observability and audit readiness - Emit structured traces for each run: run_id, prompt_version, model_version, tool calls, cost_usd, latency. - Store replayable inputs/outputs (with redaction) so runs can be re-simulated. - Create dashboards: TSR over time, cost/task, tool-call distribution, blocked actions, escalation rate. 7) Deployment process - Run shadow mode: new version sees live inputs but does not execute actions. - Canary release: start with 5–10% of traffic and promote based on thresholds. - Document rollback: how to revert prompt/model/policy versions within minutes. 8) Ongoing operations - Weekly eval refresh: add new edge cases and failures to the test set. - Monthly access review: confirm tool scopes and credentials are still minimal. - Incident discipline: postmortems for policy violations, cost spikes, or TSR drops >2% week-over-week. If you can’t measure TSR, cost/task, and policy violations daily, you’re not ready for autonomous execution—keep the agent in “assist” mode until you can.