Agent Reliability Readiness Checklist (2026)

Use this checklist before shipping any agent that calls tools, touches customer data, or triggers business actions.

1) Define the workflow and blast radius
- Write the agent’s job as a bounded spec: inputs, outputs, and what “success” means.
- List every tool/action the agent can take (read and write separately).
- Identify “high-risk actions” (money movement, identity changes, data deletion, security settings) and set explicit thresholds (e.g., refunds > $250 require approval).

2) Put budgets in code
- Set hard limits: max tool calls per run, max wall-clock time, max retries, and max cost per attempt.
- Decide escalation behavior when budgets are hit (handoff to human, respond with partial answer, schedule async follow-up).

3) Lock down tool contracts
- Implement strict schemas (JSON schema or typed models) for every tool.
- Require idempotency keys for write operations so retries don’t duplicate actions.
- Log every tool call with: user/session ID, tool name, parameters, response, latency, and error class.

4) Add policy-as-code gates
- Encode business policies outside prompts (refund rules, privacy constraints, access control).
- Enforce allowlists for fields the agent can read/write in systems like Salesforce, Zendesk, Stripe, or internal admin panels.
- For high-risk actions, require “draft then execute” with policy validation and optional human approval.

5) Build an evaluation suite (your AI CI)
- Collect 200–1,000 representative historical cases (real user requests).
- Add 50–200 adversarial/edge cases (ambiguous identifiers, missing fields, prompt-injection text in retrieved docs).
- Track at least 5 metrics: task success rate, valid tool-call rate, policy violations per 1,000 runs, p95 latency, and cost per successful run.
- Add regression tests for every production incident within 48 hours.

6) Production observability and incident response
- Dashboards: tool error rate, retries per run, policy blocks, escalations, customer reopens, and spend.
- Runbooks: how to disable write tools, force read-only mode, roll back model versions, and route to humans.
- On-call ownership: name the team responsible for P0/P1 incidents and define severity criteria.

7) Launch plan
- Start with a limited cohort (e.g., 1–5% traffic) and a clear rollback trigger.
- Use feature flags to switch models, prompts, and tool permissions without redeploying.
- Define a weekly review: top failure modes, top costs, and top policy blocks—then ship fixes.

If you can’t confidently answer who owns each metric and what happens when it fails, you’re not ready to scale the agent.