AGENT RELIABILITY LAUNCH CHECKLIST (2026)

Use this checklist to move from “agent demo” to “production system.” Aim to complete Phase 0–1 before any external customer rollout.

1) SCOPE & BLAST RADIUS
- Define the agent’s allowed actions (verbs) and forbidden actions in plain language.
- Classify each action by severity: P0 (money/admin), P1 (data access), P2 (customer comms), P3 (internal-only).
- Ensure high-severity actions are reversible (refund reversal policy, undo operations, or compensating transactions).

2) EVALUATIONS (OFFLINE)
- Create an initial scenario set (50–100) with expected outcomes, including “must refuse” cases.
- Add adversarial tests: prompt injection, policy evasion, data exfiltration attempts, and jailbreak variants.
- Track metrics by severity tier: pass rate, refusal accuracy, and “unauthorized tool attempt rate.”
- Run evals in CI for every change: model version, prompt template, tool schema, retrieval settings, policy rules.

3) OBSERVABILITY (RUNTIME)
- Log every run with: model/provider, model version, prompt/template hash, tool schema versions, retrieved doc IDs, tool inputs/outputs, policy decisions, and latency breakdown.
- Ensure runs are reproducible: store configuration snapshots and correlation IDs across services.
- Establish dashboards: p95 latency, run success rate, tool failure rate, blocked actions, retries per run, tokens per run.

4) POLICY-AS-CODE GUARDRAILS
- Implement deterministic authorization for each tool call (outside the model): role, tenant, amount thresholds, domains, time windows.
- Use scoped credentials per tool; never give the agent a “god token.”
- Add step limits: max tool calls per run, max tokens per run, and retry budgets per tool.
- Create a “safe failure” path: when denied, return an explanation + alternative (request approval, ask user, or escalate).

5) HUMAN-IN-THE-LOOP (ONLY WHERE NEEDED)
- Define which actions require approval (e.g., refunds > $50, outreach to external domains, admin changes).
- Build an approval UI that shows: proposed action, parameters, evidence/citations, and policy reason.
- Track review load: target <2–5% of total runs requiring human approval after iteration.

6) RELEASE MANAGEMENT
- Add canary rollout (1–5% traffic) for new prompts/models with automatic rollback on regression.
- Maintain an incident runbook: how to disable a tool, rotate credentials, and freeze high-risk actions.
- Set SLOs and error budgets per workflow; page on unsafe execution, not just latency.

7) COST CONTROLS
- Set budgets: tokens per run, max steps, and monthly spend per tenant/workflow.
- Route cheaply: use small models for classification/routing; reserve large reasoning models for hard cases.
- Cache retrieval and tool results where safe; avoid repeated expensive calls.

EXIT CRITERIA (MINIMUM FOR EXTERNAL LAUNCH)
- 100% runs traceable with versioned configs.
- Unsafe execution rate approximately zero; unsafe attempts are blocked and logged.
- Stable p95 latency target met for the workflow.
- Eval suite running in CI with clear pass thresholds by severity.
- Clear rollback and tool-disable procedures tested at least once.