Agent Reliability Launch Checklist (2026)

Use this checklist to move from an agent prototype to a production deployment with measurable reliability. Target: one workflow shipped with clear autonomy boundaries in 30 days.

1) Define the workflow and success criteria
- Write a one-sentence “job”: what the agent does and where it operates (e.g., “Resolve order-status tickets and issue refunds up to $50”).
- Define Task Success Rate (TSR) with an objective outcome (ticket closed with correct tags; refund amount matches policy).
- Set non-functional SLOs: p95 latency (e.g., <= 5s), cost per task (e.g., <= $0.10), and max failure rate.

2) Scope tools and permissions (least privilege)
- Inventory required tools/APIs and classify by risk: read-only, reversible write, irreversible/regulatory.
- Create a dedicated agent identity (service principal) and issue scoped, short-lived credentials.
- Add allowlists for tools and destinations (domains, repos, projects) and hard rate limits.

3) Implement budgets and stop conditions
- Enforce max tokens, max tool calls, and max wall time per task.
- Add loop detection (repeat tool call patterns, repeated failures) and a forced “escalate” outcome.
- Define “safe fallback” behaviors when budgets are exceeded (hand off to human, respond with status-only output).

4) Add policy-as-code gates
- Express rules in a testable engine (OPA/Cedar or equivalent), not only in prompts.
- Required rules: action caps (e.g., refund <= $50), prohibited actions, PII handling, logging constraints.
- Ensure every decision is logged: policy version, rule ID, allow/deny, justification.

5) Build an evaluation suite (deployment gate)
- Curate 200–500 representative tasks from real history; remove sensitive data.
- Add 30–50 adversarial cases: prompt injection, ambiguous instructions, conflicting policies.
- Implement automated checks: schema validity, tool argument correctness, policy compliance.
- Add human review for a sampled set and for all high-risk tiers.

6) Observability and audit readiness
- Emit structured traces for each run: run_id, prompt_version, model_version, tool calls, cost_usd, latency.
- Store replayable inputs/outputs (with redaction) so runs can be re-simulated.
- Create dashboards: TSR over time, cost/task, tool-call distribution, blocked actions, escalation rate.

7) Deployment process
- Run shadow mode: new version sees live inputs but does not execute actions.
- Canary release: start with 5–10% of traffic and promote based on thresholds.
- Document rollback: how to revert prompt/model/policy versions within minutes.

8) Ongoing operations
- Weekly eval refresh: add new edge cases and failures to the test set.
- Monthly access review: confirm tool scopes and credentials are still minimal.
- Incident discipline: postmortems for policy violations, cost spikes, or TSR drops >2% week-over-week.

If you can’t measure TSR, cost/task, and policy violations daily, you’re not ready for autonomous execution—keep the agent in “assist” mode until you can.