AGENT RELIABILITY LAUNCH CHECKLIST (30 DAYS)

Goal: Ship one production AI agent workflow with measurable quality, bounded risk, and an audit trail.

1) Scope & Risk (Days 1–3)
- Pick ONE workflow with clear ROI (e.g., support triage, CRM enrichment).
- Define the blast radius: systems touched, data classes (PII/PCI/PHI), money movement.
- Write 10–20 invariants (“must never” rules). Examples:
  * Never send external messages without approval.
  * Never access bulk export endpoints.
  * Never modify permissions/groups.
- Assign an autonomy tier (Draft-only → Auto-low-risk → Human gate for high-risk).

2) Tooling & Permissions (Days 4–10)
- Create a tool registry: name, description, schema, owner, and risk tier.
- Enforce least privilege with service accounts per agent/workflow.
- Add hard budget caps: max tool calls per task, max tokens, max wall-clock time.
- Implement threshold approvals (e.g., refunds > $25 require human, > $250 manager).
- Build a kill switch: disable all actions or downgrade to draft-only instantly.

3) Policy & Guardrails (Days 11–15)
- Add policy checks outside the model (RBAC/allow-lists/OPA rules).
- Quarantine untrusted text: extract entities into structured fields before planning actions.
- Add deterministic validators per tool (entity existence, amount bounds, state checks).
- Log every proposed action, blocked action (with reason), and executed action.

4) Evaluation (Days 16–22)
- Assemble a golden set (200–500 real cases; scrub PII).
- Define metrics:
  * Task success rate
  * Tool-call semantic accuracy (right customer/object)
  * Policy violation rate
  * Cost per successful task
  * Escalation rate / containment rate
- Set initial targets and error budgets (what’s acceptable to ship).
- Run shadow mode in production and compare to human outcomes.

5) Release & Operations (Days 23–30)
- Progressive rollout: internal → 5% → 25% → 100% with rollback criteria.
- Create an on-call rotation and runbook:
  * How to disable actions
  * How to revert model/prompt/tool schema
  * How to replay traces for RCA
- Set alerts: spike in blocked actions, unusual tool endpoints, retry loops, latency.
- Schedule weekly review: top failures, policy gaps, test additions.

Definition of Done
- You can answer: What changed? What did quality do? What did it cost? What actions were taken? Who approved them? Where is the audit trail?