Agent Production Readiness Checklist (2026) Use this checklist before you let an agent take real actions (write to systems, email customers, change records, execute jobs). Score each item Pass/Needs Work and assign an owner. 1) Define “Success” Precisely - Write a strict success definition for each workflow (what state must change, what fields must match, what is forbidden). - Decide how you will measure success@1 and success@k (k=2 or 3). - Map the workflow metric to a business KPI (e.g., cost per resolved ticket, time-to-close, revenue influenced). 2) Build and Maintain Evaluation Data - Create an initial gold dataset of 200–500 scenarios: 40% happy path, 40% edge cases, 20% red-team/policy tests. - Establish a monthly refresh process fed by real failures and escalations. - Define a labeling rubric (strict correctness rules) and ensure two-person review for high-risk tasks. 3) Instrumentation and Tracing - Log every run: model version, prompts, tool calls, tool outputs (with redaction), step latencies, and final outcome. - Define retention and access controls for traces (PII handling, who can view, how long stored). - Build dashboards for: success@1, escalation rate, policy violation rate, p95 latency, and effective cost per successful task. 4) Tool Permissions and Policy Gates - Use least-privilege credentials per tool; avoid broad shared tokens. - Create explicit allowlists for tools and actions; deny by default. - Add pre-execution policy checks (scope, amount limits, entity ownership, required approvals). - Add post-execution verification (did the intended state change happen correctly?). 5) Reliability Controls - Enforce caps: max steps, max tool retries, max replans. - Add deterministic validation: JSON schema checks, constraint checks (totals, dates, currency), and reconciliation diffs. - Define safe fallbacks: escalate to human with a structured handoff summary and full context. 6) Cost and Latency Budgets - Set target cost per successful task and max cost per attempt. - Track effective cost per success (includes retries and tool calls). - Add alerts when p95 latency or cost per success drifts beyond thresholds (e.g., +20% week-over-week). 7) Deployment and Change Management - Require offline evals in CI for any change to prompts, tools, policies, or model versions. - Roll out with progressive exposure (1% → 10% → 50% → 100%) and clear rollback triggers. - Keep a changelog of model/prompt/policy versions tied to eval results. 8) Ongoing Operations - Run weekly incident-style reviews of top failures; assign root causes (retrieval, tool errors, policy gaps, prompt ambiguity). - Convert failures into new eval cases and add regression tests. - Reassess scope quarterly: only expand autonomy when success@1 and human-minutes-per-success improve. If you can’t pass sections 1–4, you’re not shipping an agent—you’re shipping a demo. The fastest way to scale is to earn trust through measurably reliable, auditable autonomy in a narrow workflow.