Production Agent Readiness Checklist (2026)

Use this checklist before you ship any agent that takes actions (writes to systems, sends customer messages, triggers refunds, changes records).

1) Scope & Ownership
- Define one workflow with a clear “done” state (e.g., ticket resolved, invoice generated).
- Name an accountable owner (PM/ops lead) and a technical owner (eng lead).
- Document what the agent is NOT allowed to do (explicit prohibitions).

2) Tools & Permissions
- List every tool/API the agent can call, with allowed methods and fields.
- Implement least-privilege access (separate service accounts; environment scoping).
- Set approval thresholds for irreversible actions (e.g., refunds > $200 require human).
- Add rate limits and “max tool calls per job” (typical: 3–10).

3) Data & Retrieval (if using RAG)
- Create a data contract per source: owner, refresh cadence, retention, access rules.
- Track freshness lag (update → index availability) and set an SLA.
- Use hybrid retrieval when proper nouns/IDs matter (sparse + dense).
- Require citations for policy decisions and customer-facing factual claims.

4) Observability & Audit
- Emit a trace per job: inputs, retrieved docs, tool calls, outputs, latency, cost.
- Store tool-call logs in an auditable system (SIEM-friendly if enterprise).
- Add a “job ledger” record: model_calls, tokens, tool_calls, wall_time, cost_usd, outcome.

5) Evals & Release Discipline
- Build an eval set of 200–1,000 real cases (include edge cases).
- Define pass/fail rubrics and label reasons (retrieval miss, tool error, policy violation).
- Run regression evals on every change: model, prompt, retrieval index, tool schema.
- Use canary releases (start 1–5% traffic) and maintain rollback capability.

6) Safety & Verification
- Add deterministic validators: JSON schema, regex, business rules, forbidden content filters.
- Implement a verifier step for high-risk actions (second model or rule-based checks).
- Define escalation rules: low confidence, missing citation, policy ambiguity → human.

7) Economics & KPIs
- Measure cost per completed outcome (not cost per token).
- Track completion rate, critical error rate, human escalation rate, and p95 latency.
- Model human review cost (fully loaded $/min) and include it in unit economics.
- Set “stop-ship” thresholds (e.g., critical error rate > 2% or latency p95 > 15s).

Go/No-Go Rule of Thumb
Ship to broader traffic only when you can show 2+ weeks of stable metrics on the canary cohort and you can explain every major failure mode with trace evidence.