Agent Reliability Readiness Checklist (2026)

Use this checklist to move from “agent demo” to “production operator” with bounded autonomy, predictable cost, and auditable behavior.

1) Pick one workflow with a denominator
- Define the unit: per ticket, per invoice, per lead, per incident.
- Write a single sentence definition of success and failure.

2) Define the action envelope (bounded autonomy)
- List allowed tools and disallowed tools.
- Decide read-only vs write-enabled phases.
- Set maximum number of tool calls per task (e.g., 10) and maximum model calls (e.g., 20).

3) Establish identity + least privilege
- Use scoped OAuth/service accounts per environment.
- Implement allowlists (domains, projects, repositories, ticket queues).
- Require idempotency keys for any write action.

4) Set a cost budget per completed task
- Choose a p95 target (example: $0.25 per support resolution; $2.00 per internal ops task).
- Track retries and tool costs; don’t measure tokens alone.
- Add hard caps: stop and escalate if budget is exceeded.

5) Instrument end-to-end tracing before shipping
- Generate a trace_id per task.
- Record: prompts, retrieved docs metadata, tool calls (args + results), and final actions.
- Ensure every tool execution is linked to a trace_id (block writes without it).

6) Build an initial eval suite
- Curate 50–100 “gold” tasks from real historical cases.
- Add adversarial tests: prompt injection, data exfiltration attempts, and refusal edge cases.
- Define metrics: task success, escalation correctness, undo rate, p95 latency, p95 cost.

7) Add schema validation for structured outputs
- Use JSON Schema/Pydantic validation.
- Reject invalid tool arguments and force a corrected retry.
- Prefer narrow tools (purpose-built endpoints) over general “execute anything” tools.

8) Add verification + gating before any write
- Policy-as-code checks (tenant, data class, destination, amount caps).
- Consistency checks (e.g., refund amount <= cap; email contains required footer).
- Optional second-pass critique for high-stakes steps.

9) Implement human fallback paths
- Define who gets escalations and the SLA (e.g., <5 minutes internal).
- Provide “dry-run diffs” and approvals for early launch.
- Track undo actions and require root-cause notes for reversals.

10) Stage rollout with feature flags
- Start with internal users, then 1–5% of customers, then expand.
- Gate expansion on metrics: success rate, undo rate, and cost p95.
- Keep kill switches for: tool writes, external messaging, and spending actions.

11) Run postmortems like an SRE team
- For any high-severity error: reconstruct the trace, replay, document the root cause.
- Track incident rate per 1,000 tasks.
- Add the failure case to the eval suite to prevent regressions.

12) Plan for provider/model portability
- Separate orchestration and evaluation from model providers.
- Maintain a routing strategy (cheap model for extraction; strong model for synthesis).
- Re-run eval suites before any model, prompt, tool, or retrieval index changes.

If you can’t confidently pass steps 1–6, you’re not ready for autonomous writes. Treat reliability as the product: budgets, policies, tracing, and evals are what make agents shippable in 2026.