AI Agent Production Readiness Checklist (2026) Use this checklist as a launch gate for any tool-using (action-taking) agent. It’s designed for founders, engineering leaders, and operators who need a system that is safe, measurable, and financially predictable. 1) Scope & ROI - Define ONE workflow with a clear “done” state (e.g., refund eligibility decision, invoice match, lead enrichment). - Define success metrics with thresholds (examples): 20% cycle-time reduction, 15% net ticket deflection after reopens, 95% correctness on golden tasks. - Define a cost envelope: max $/task, max tool calls/task, max tokens/task, max latency/task. 2) Tooling Contracts - List the minimum tools required; remove “nice-to-have” tools that expand action space. - For each tool: JSON schema, required fields, allowed ranges, and explicit error codes. - Make tools idempotent where possible (safe retries). Add request IDs and trace IDs. 3) Policy & Permissions (Non-negotiable) - Least-privilege credentials per tool; short-lived tokens (target TTL ≤ 1 hour). - Policy-as-code rules (OPA/Cedar/etc.) for high-risk actions (money movement, data changes, external comms). - Human-in-the-loop thresholds (e.g., auto-refund ≤ $50; require approval above). - Add a kill switch + safe fallback path (human queue or read-only mode). 4) Memory & Data Governance - Separate memory types: task memory (TTL), user preferences (consent + delete), org knowledge (access-controlled). - Require citations for policy claims and numeric claims. - Implement retrieval + reranking; add summarization/compression to reduce context size. - Document retention and deletion behavior; align with customer contracts and compliance. 5) Observability - Log 100% of tool calls with parameters (redacted if needed), latency, status, and policy decisions. - Store per-run traces: model version, prompt template version, retrieved doc IDs/versions, cost estimate. - Create dashboards: pass rate on golden tasks, $/successful outcome, escalation rate, reopen/rollback rate. 6) Evaluation & Release Process - Build a golden-task suite (start 50–200 tasks) and run it weekly and on every significant change. - Track regressions explicitly (e.g., “refund policy: -3% since v3”). - Canary release (5% → 25% → 100%) with automated rollback triggers. - Run quarterly “game days” to test kill switch, escalation paths, and audit log integrity. 7) Unit Economics - Measure cost per successful outcome (not per conversation). - Implement model cascade: cheap router + stronger reasoner only when needed. - Add caching for retrieval and deterministic tool outputs where safe. - Put hard budgets in the controller: max tool calls, max retries, max tokens. Exit Criteria (Ready for wider rollout) - ≥ 95% pass rate on golden tasks for the target workflow. - Known failure modes are bounded by policy gates (no uncontrolled money/data actions). - Cost per successful outcome is within the pre-set envelope for 2 consecutive weeks. - On-call runbook exists, including rollback steps and escalation ownership.