Production Agent Readiness Checklist (2026) Use this checklist before you ship any agent that takes actions (writes to systems, sends customer messages, triggers refunds, changes records). 1) Scope & Ownership - Define one workflow with a clear “done” state (e.g., ticket resolved, invoice generated). - Name an accountable owner (PM/ops lead) and a technical owner (eng lead). - Document what the agent is NOT allowed to do (explicit prohibitions). 2) Tools & Permissions - List every tool/API the agent can call, with allowed methods and fields. - Implement least-privilege access (separate service accounts; environment scoping). - Set approval thresholds for irreversible actions (e.g., refunds > $200 require human). - Add rate limits and “max tool calls per job” (typical: 3–10). 3) Data & Retrieval (if using RAG) - Create a data contract per source: owner, refresh cadence, retention, access rules. - Track freshness lag (update → index availability) and set an SLA. - Use hybrid retrieval when proper nouns/IDs matter (sparse + dense). - Require citations for policy decisions and customer-facing factual claims. 4) Observability & Audit - Emit a trace per job: inputs, retrieved docs, tool calls, outputs, latency, cost. - Store tool-call logs in an auditable system (SIEM-friendly if enterprise). - Add a “job ledger” record: model_calls, tokens, tool_calls, wall_time, cost_usd, outcome. 5) Evals & Release Discipline - Build an eval set of 200–1,000 real cases (include edge cases). - Define pass/fail rubrics and label reasons (retrieval miss, tool error, policy violation). - Run regression evals on every change: model, prompt, retrieval index, tool schema. - Use canary releases (start 1–5% traffic) and maintain rollback capability. 6) Safety & Verification - Add deterministic validators: JSON schema, regex, business rules, forbidden content filters. - Implement a verifier step for high-risk actions (second model or rule-based checks). - Define escalation rules: low confidence, missing citation, policy ambiguity → human. 7) Economics & KPIs - Measure cost per completed outcome (not cost per token). - Track completion rate, critical error rate, human escalation rate, and p95 latency. - Model human review cost (fully loaded $/min) and include it in unit economics. - Set “stop-ship” thresholds (e.g., critical error rate > 2% or latency p95 > 15s). Go/No-Go Rule of Thumb Ship to broader traffic only when you can show 2+ weeks of stable metrics on the canary cohort and you can explain every major failure mode with trace evidence.