AI Agent Production Readiness Checklist (2026)

Use this checklist as a launch gate for any tool-using (action-taking) agent. It’s designed for founders, engineering leaders, and operators who need a system that is safe, measurable, and financially predictable.

1) Scope & ROI
- Define ONE workflow with a clear “done” state (e.g., refund eligibility decision, invoice match, lead enrichment).
- Define success metrics with thresholds (examples): 20% cycle-time reduction, 15% net ticket deflection after reopens, 95% correctness on golden tasks.
- Define a cost envelope: max $/task, max tool calls/task, max tokens/task, max latency/task.

2) Tooling Contracts
- List the minimum tools required; remove “nice-to-have” tools that expand action space.
- For each tool: JSON schema, required fields, allowed ranges, and explicit error codes.
- Make tools idempotent where possible (safe retries). Add request IDs and trace IDs.

3) Policy & Permissions (Non-negotiable)
- Least-privilege credentials per tool; short-lived tokens (target TTL ≤ 1 hour).
- Policy-as-code rules (OPA/Cedar/etc.) for high-risk actions (money movement, data changes, external comms).
- Human-in-the-loop thresholds (e.g., auto-refund ≤ $50; require approval above).
- Add a kill switch + safe fallback path (human queue or read-only mode).

4) Memory & Data Governance
- Separate memory types: task memory (TTL), user preferences (consent + delete), org knowledge (access-controlled).
- Require citations for policy claims and numeric claims.
- Implement retrieval + reranking; add summarization/compression to reduce context size.
- Document retention and deletion behavior; align with customer contracts and compliance.

5) Observability
- Log 100% of tool calls with parameters (redacted if needed), latency, status, and policy decisions.
- Store per-run traces: model version, prompt template version, retrieved doc IDs/versions, cost estimate.
- Create dashboards: pass rate on golden tasks, $/successful outcome, escalation rate, reopen/rollback rate.

6) Evaluation & Release Process
- Build a golden-task suite (start 50–200 tasks) and run it weekly and on every significant change.
- Track regressions explicitly (e.g., “refund policy: -3% since v3”).
- Canary release (5% → 25% → 100%) with automated rollback triggers.
- Run quarterly “game days” to test kill switch, escalation paths, and audit log integrity.

7) Unit Economics
- Measure cost per successful outcome (not per conversation).
- Implement model cascade: cheap router + stronger reasoner only when needed.
- Add caching for retrieval and deterministic tool outputs where safe.
- Put hard budgets in the controller: max tool calls, max retries, max tokens.

Exit Criteria (Ready for wider rollout)
- ≥ 95% pass rate on golden tasks for the target workflow.
- Known failure modes are bounded by policy gates (no uncontrolled money/data actions).
- Cost per successful outcome is within the pre-set envelope for 2 consecutive weeks.
- On-call runbook exists, including rollback steps and escalation ownership.