Production AI Agent Readiness Checklist (2026)

Use this checklist before you roll an AI agent beyond a pilot. It’s written for founders, engineering leads, and operators who need reliability, cost control, and auditability.

1) Scope & Success Criteria
- Define ONE initial intent (e.g., “refund under $50” or “reset password for SSO users”).
- Specify success as a business metric: % auto-resolved, cost per resolved task, and p95 latency.
- Decide what the agent must never do (e.g., change billing plan, send external emails, access payroll).

2) Workflow Architecture
- Represent behavior as a workflow graph (states like intake → retrieve → plan → execute → verify → finalize).
- Add explicit stop conditions and retry limits (no infinite loops).
- Implement fallbacks: smaller model, reduced retrieval, or human escalation.

3) Tools, Permissions, and Schemas
- Wrap every tool behind a typed interface with strict JSON schema validation.
- Use least-privilege service accounts per agent; no shared admin tokens.
- Implement step-up authorization for high-risk actions (refunds, account changes, security actions).

4) Budgeting & Model Tiering
- Set a hard max cost per run and per user/day (tokens + tool calls).
- Route cheap tasks to smaller models; reserve premium models for complex planning.
- Track cost per successful outcome (not tokens per message).

5) Retrieval & Data Access
- Permission-aware retrieval: ensure the agent can only fetch documents the user can access.
- Store retrieval IDs and citations; require citations for policy/knowledge answers.
- Redact or tokenize sensitive fields (PII, secrets) before sending to any model provider.

6) Runtime Verification
- Add a verifier stage that checks constraints (citations present, correct account, totals match).
- Add deterministic checks where possible (schema validation, SQL recomputation, checksum comparisons).
- Define escalation rules: low confidence, failed verification, tool errors, or budget exceeded.

7) Evaluation Harness
- Build a golden set of 100–300 real historical tasks (sanitized) with expected outputs/tool calls.
- Track: success rate, hallucination rate, escalation rate, and regression after changes.
- Version prompts, tools, and indices; treat prompt changes like deployments.

8) Observability & Operations
- Dashboard: volume, success, escalation, p50/p95 latency, tool error rate, cost per success.
- Store replayable traces (with redaction) for incident reproduction.
- Define SLOs and an incident playbook (mitigation: disable risky tools, force fallback model, reduce retrieval breadth).

9) Compliance & Audit
- Log every tool call (parameters + response hash), retrieval IDs, and final outputs.
- Set retention policy (e.g., full traces 30–90 days; longer-lived hashed summaries).
- Document data flows for legal/security review: what data is sent to which providers and why.

10) Launch Plan (30 days)
- Shadow mode first: generate recommendations without taking action; compare to human decisions.
- Roll out to 1–5% traffic with step-up approvals.
- Expand scope only after SLOs hold for 7 consecutive days and regressions are gated by evals.

If you can’t answer these three questions, you’re not ready to scale: (1) What is the maximum cost per run? (2) What is the measurable success rate on a golden set? (3) Can you produce an audit trail of what the agent accessed and did?