Production AI Agent Readiness Checklist (2026) Use this checklist before you roll an AI agent beyond a pilot. It’s written for founders, engineering leads, and operators who need reliability, cost control, and auditability. 1) Scope & Success Criteria - Define ONE initial intent (e.g., “refund under $50” or “reset password for SSO users”). - Specify success as a business metric: % auto-resolved, cost per resolved task, and p95 latency. - Decide what the agent must never do (e.g., change billing plan, send external emails, access payroll). 2) Workflow Architecture - Represent behavior as a workflow graph (states like intake → retrieve → plan → execute → verify → finalize). - Add explicit stop conditions and retry limits (no infinite loops). - Implement fallbacks: smaller model, reduced retrieval, or human escalation. 3) Tools, Permissions, and Schemas - Wrap every tool behind a typed interface with strict JSON schema validation. - Use least-privilege service accounts per agent; no shared admin tokens. - Implement step-up authorization for high-risk actions (refunds, account changes, security actions). 4) Budgeting & Model Tiering - Set a hard max cost per run and per user/day (tokens + tool calls). - Route cheap tasks to smaller models; reserve premium models for complex planning. - Track cost per successful outcome (not tokens per message). 5) Retrieval & Data Access - Permission-aware retrieval: ensure the agent can only fetch documents the user can access. - Store retrieval IDs and citations; require citations for policy/knowledge answers. - Redact or tokenize sensitive fields (PII, secrets) before sending to any model provider. 6) Runtime Verification - Add a verifier stage that checks constraints (citations present, correct account, totals match). - Add deterministic checks where possible (schema validation, SQL recomputation, checksum comparisons). - Define escalation rules: low confidence, failed verification, tool errors, or budget exceeded. 7) Evaluation Harness - Build a golden set of 100–300 real historical tasks (sanitized) with expected outputs/tool calls. - Track: success rate, hallucination rate, escalation rate, and regression after changes. - Version prompts, tools, and indices; treat prompt changes like deployments. 8) Observability & Operations - Dashboard: volume, success, escalation, p50/p95 latency, tool error rate, cost per success. - Store replayable traces (with redaction) for incident reproduction. - Define SLOs and an incident playbook (mitigation: disable risky tools, force fallback model, reduce retrieval breadth). 9) Compliance & Audit - Log every tool call (parameters + response hash), retrieval IDs, and final outputs. - Set retention policy (e.g., full traces 30–90 days; longer-lived hashed summaries). - Document data flows for legal/security review: what data is sent to which providers and why. 10) Launch Plan (30 days) - Shadow mode first: generate recommendations without taking action; compare to human decisions. - Roll out to 1–5% traffic with step-up approvals. - Expand scope only after SLOs hold for 7 consecutive days and regressions are gated by evals. If you can’t answer these three questions, you’re not ready to scale: (1) What is the maximum cost per run? (2) What is the measurable success rate on a golden set? (3) Can you produce an audit trail of what the agent accessed and did?