Production Agent Readiness Checklist (2026)

Use this checklist before you let an AI agent touch real customer workflows or production systems. The goal is not “zero risk,” but bounded risk with measurable outcomes.

1) Scope & Success Criteria
- Define one workflow with a clear start/end state (e.g., “draft refund response under $100”).
- Write success criteria tied to outcomes: handle time, MTTR, SLA, CSAT, approval rate.
- Create a “truth signal” for evaluation (human decision, known correct output, downstream metric).

2) Tooling & Interfaces
- Tools are idempotent where possible (idempotency keys for writes).
- Every write-capable tool supports dry-run and returns a diff/preview.
- Tool schemas are strict: typed fields, enums for reason codes, no free-form identifiers.
- Separate read-only credentials from write credentials; write keys only available in a controlled runner.

3) Memory & Data Handling
- Split memory into: session state, long-term preferences, organizational docs.
- Set retention/TTL per memory type; document deletion mechanisms.
- Redact secrets (API keys, tokens) before logs and before model context.
- For customer data: document tenant isolation, access controls, and audit needs.

4) Guardrails & Policy
- Implement action tiering (read-only / reversible / irreversible) with explicit gates.
- High-risk actions require step-up controls: human approval and/or deterministic policy checks.
- Policies are versioned and testable (policy-as-code); prompt changes are reviewed like code.

5) Evaluation & Testing
- Build a test set of at least 100 real examples (redacted) plus 10–20 adversarial cases.
- Track: task success rate, citation coverage, tool call success, loop/retry rate, regret rate.
- Run offline evaluation before production; run canary in production with human gating.

6) Observability & Operations
- One trace links: user input, retrieval docs, model outputs, tool calls, and final action.
- Alerts exist for: tool failure spikes, runaway loops, cost per task anomalies.
- Rollback plan: ability to disable tools, pin model versions, and revert prompts/policies.
- Ownership is clear: on-call rotation or named operators for the agent service.

7) Economics & Rollout
- Track cost per successful task (model + tools + human review time).
- Start with 5–10% traffic; expand only when regret rate and tool success meet targets.
- Define the autonomy ladder: what the agent can do today vs after 2–4 weeks of stable metrics.

If you can’t check at least 80% of the items above, ship with tighter gates (read-only or draft-only) until the platform is ready.