AGENT RELIABILITY LAUNCH CHECKLIST (2026)

Use this checklist to ship a tool-using agent that’s safe, measurable, and economically predictable.

1) DEFINE THE JOB
- Write a one-sentence “definition of done” for the agent’s primary workflow.
- List systems of record it will touch (e.g., Jira, GitHub, Salesforce, AWS).
- Identify irreversible actions (deletes, refunds, permission changes) and mark them as Tier 3–4.

2) CREATE A RISK TIER POLICY
- Tier 0: read-only. Tier 1: low-risk writes. Tier 2: business-impacting writes. Tier 3: prod/financial. Tier 4: regulatory/irreversible.
- For each tier, define required controls: allowlists, approvals, audit logs, rollback plans.

3) DESIGN THE EXECUTION ARCHITECTURE
- Separate planning from execution: planner proposes steps; executor runs tools.
- Require “preview/diff” output for any write action.
- Add deterministic validators (schema checks, regex/allowlists, policy engine like OPA).
- Set hard limits: max tool calls, max retries, max wall time.

4) PERMISSIONS AND IDENTITY
- Create a dedicated identity per agent (no shared API keys).
- Apply least privilege: only the minimal scopes/endpoints required.
- Store secrets in Vault/KMS; rotate on a schedule.

5) EVALUATION SUITE (BEFORE PRODUCTION)
- Build 50–200 representative tasks with expected outcomes.
- Include edge cases: partial data, ambiguous inputs, tool timeouts, permission errors.
- Define failure taxonomy: auth, tool error, parsing, planner error, policy violation.
- Gate releases on regression thresholds (e.g., Tier 1 success rate >85%, Tier 3 policy violations ~0%).

6) OBSERVABILITY AND AUDITABILITY
- Log: prompt/response (with redaction), tool inputs/outputs, latency per step, final state changes.
- Add tracing IDs so every run is replayable end-to-end.
- Build dashboards: success rate, p95 latency, cost per success, tool-call p95, top failure reasons.

7) COST AND LATENCY CONTROLS
- Implement model routing: small model for extraction/formatting, larger model for planning.
- Cache stable retrieval/tool results where freshness allows.
- Fail fast on repeated identical errors; don’t allow infinite retries.
- Track cost per successful task and cap worst-case run cost.

8) ROLLOUT PLAN
- Phase 1: shadow mode (agent proposes; human executes).
- Phase 2: automate Tier 0–1 with audit sampling.
- Phase 3: gated Tier 2+ with explicit approvals.
- Keep a kill switch and rollback mechanism for every workflow.

9) OPERATIONS
- Write runbooks for common failure modes (tool down, auth revoked, model drift).
- Define on-call ownership for the agent service.
- Hold postmortems for any incident involving agent actions.

Exit criteria: You can prove (with data) that the agent hits success targets, stays within cost/latency budgets, and cannot execute high-risk actions without the required policy gates and approvals.