Agent Production Readiness Checklist (2026)

Use this checklist before you let an agent take real actions (write to systems, email customers, change records, execute jobs). Score each item Pass/Needs Work and assign an owner.

1) Define “Success” Precisely
- Write a strict success definition for each workflow (what state must change, what fields must match, what is forbidden).
- Decide how you will measure success@1 and success@k (k=2 or 3).
- Map the workflow metric to a business KPI (e.g., cost per resolved ticket, time-to-close, revenue influenced).

2) Build and Maintain Evaluation Data
- Create an initial gold dataset of 200–500 scenarios: 40% happy path, 40% edge cases, 20% red-team/policy tests.
- Establish a monthly refresh process fed by real failures and escalations.
- Define a labeling rubric (strict correctness rules) and ensure two-person review for high-risk tasks.

3) Instrumentation and Tracing
- Log every run: model version, prompts, tool calls, tool outputs (with redaction), step latencies, and final outcome.
- Define retention and access controls for traces (PII handling, who can view, how long stored).
- Build dashboards for: success@1, escalation rate, policy violation rate, p95 latency, and effective cost per successful task.

4) Tool Permissions and Policy Gates
- Use least-privilege credentials per tool; avoid broad shared tokens.
- Create explicit allowlists for tools and actions; deny by default.
- Add pre-execution policy checks (scope, amount limits, entity ownership, required approvals).
- Add post-execution verification (did the intended state change happen correctly?).

5) Reliability Controls
- Enforce caps: max steps, max tool retries, max replans.
- Add deterministic validation: JSON schema checks, constraint checks (totals, dates, currency), and reconciliation diffs.
- Define safe fallbacks: escalate to human with a structured handoff summary and full context.

6) Cost and Latency Budgets
- Set target cost per successful task and max cost per attempt.
- Track effective cost per success (includes retries and tool calls).
- Add alerts when p95 latency or cost per success drifts beyond thresholds (e.g., +20% week-over-week).

7) Deployment and Change Management
- Require offline evals in CI for any change to prompts, tools, policies, or model versions.
- Roll out with progressive exposure (1% → 10% → 50% → 100%) and clear rollback triggers.
- Keep a changelog of model/prompt/policy versions tied to eval results.

8) Ongoing Operations
- Run weekly incident-style reviews of top failures; assign root causes (retrieval, tool errors, policy gaps, prompt ambiguity).
- Convert failures into new eval cases and add regression tests.
- Reassess scope quarterly: only expand autonomy when success@1 and human-minutes-per-success improve.

If you can’t pass sections 1–4, you’re not shipping an agent—you’re shipping a demo. The fastest way to scale is to earn trust through measurably reliable, auditable autonomy in a narrow workflow.