AGENT RELIABILITY LAUNCH CHECKLIST (2026) Use this checklist to move from “agent demo” to “production system.” Aim to complete Phase 0–1 before any external customer rollout. 1) SCOPE & BLAST RADIUS - Define the agent’s allowed actions (verbs) and forbidden actions in plain language. - Classify each action by severity: P0 (money/admin), P1 (data access), P2 (customer comms), P3 (internal-only). - Ensure high-severity actions are reversible (refund reversal policy, undo operations, or compensating transactions). 2) EVALUATIONS (OFFLINE) - Create an initial scenario set (50–100) with expected outcomes, including “must refuse” cases. - Add adversarial tests: prompt injection, policy evasion, data exfiltration attempts, and jailbreak variants. - Track metrics by severity tier: pass rate, refusal accuracy, and “unauthorized tool attempt rate.” - Run evals in CI for every change: model version, prompt template, tool schema, retrieval settings, policy rules. 3) OBSERVABILITY (RUNTIME) - Log every run with: model/provider, model version, prompt/template hash, tool schema versions, retrieved doc IDs, tool inputs/outputs, policy decisions, and latency breakdown. - Ensure runs are reproducible: store configuration snapshots and correlation IDs across services. - Establish dashboards: p95 latency, run success rate, tool failure rate, blocked actions, retries per run, tokens per run. 4) POLICY-AS-CODE GUARDRAILS - Implement deterministic authorization for each tool call (outside the model): role, tenant, amount thresholds, domains, time windows. - Use scoped credentials per tool; never give the agent a “god token.” - Add step limits: max tool calls per run, max tokens per run, and retry budgets per tool. - Create a “safe failure” path: when denied, return an explanation + alternative (request approval, ask user, or escalate). 5) HUMAN-IN-THE-LOOP (ONLY WHERE NEEDED) - Define which actions require approval (e.g., refunds > $50, outreach to external domains, admin changes). - Build an approval UI that shows: proposed action, parameters, evidence/citations, and policy reason. - Track review load: target <2–5% of total runs requiring human approval after iteration. 6) RELEASE MANAGEMENT - Add canary rollout (1–5% traffic) for new prompts/models with automatic rollback on regression. - Maintain an incident runbook: how to disable a tool, rotate credentials, and freeze high-risk actions. - Set SLOs and error budgets per workflow; page on unsafe execution, not just latency. 7) COST CONTROLS - Set budgets: tokens per run, max steps, and monthly spend per tenant/workflow. - Route cheaply: use small models for classification/routing; reserve large reasoning models for hard cases. - Cache retrieval and tool results where safe; avoid repeated expensive calls. EXIT CRITERIA (MINIMUM FOR EXTERNAL LAUNCH) - 100% runs traceable with versioned configs. - Unsafe execution rate approximately zero; unsafe attempts are blocked and logged. - Stable p95 latency target met for the workflow. - Eval suite running in CI with clear pass thresholds by severity. - Clear rollback and tool-disable procedures tested at least once.