AGENT RELIABILITY LAUNCH CHECKLIST (30 DAYS) Goal: Ship one production AI agent workflow with measurable quality, bounded risk, and an audit trail. 1) Scope & Risk (Days 1–3) - Pick ONE workflow with clear ROI (e.g., support triage, CRM enrichment). - Define the blast radius: systems touched, data classes (PII/PCI/PHI), money movement. - Write 10–20 invariants (“must never” rules). Examples: * Never send external messages without approval. * Never access bulk export endpoints. * Never modify permissions/groups. - Assign an autonomy tier (Draft-only → Auto-low-risk → Human gate for high-risk). 2) Tooling & Permissions (Days 4–10) - Create a tool registry: name, description, schema, owner, and risk tier. - Enforce least privilege with service accounts per agent/workflow. - Add hard budget caps: max tool calls per task, max tokens, max wall-clock time. - Implement threshold approvals (e.g., refunds > $25 require human, > $250 manager). - Build a kill switch: disable all actions or downgrade to draft-only instantly. 3) Policy & Guardrails (Days 11–15) - Add policy checks outside the model (RBAC/allow-lists/OPA rules). - Quarantine untrusted text: extract entities into structured fields before planning actions. - Add deterministic validators per tool (entity existence, amount bounds, state checks). - Log every proposed action, blocked action (with reason), and executed action. 4) Evaluation (Days 16–22) - Assemble a golden set (200–500 real cases; scrub PII). - Define metrics: * Task success rate * Tool-call semantic accuracy (right customer/object) * Policy violation rate * Cost per successful task * Escalation rate / containment rate - Set initial targets and error budgets (what’s acceptable to ship). - Run shadow mode in production and compare to human outcomes. 5) Release & Operations (Days 23–30) - Progressive rollout: internal → 5% → 25% → 100% with rollback criteria. - Create an on-call rotation and runbook: * How to disable actions * How to revert model/prompt/tool schema * How to replay traces for RCA - Set alerts: spike in blocked actions, unusual tool endpoints, retry loops, latency. - Schedule weekly review: top failures, policy gaps, test additions. Definition of Done - You can answer: What changed? What did quality do? What did it cost? What actions were taken? Who approved them? Where is the audit trail?