Agent Reliability Launch Checklist (2026)

Use this checklist to ship an agent that executes tools safely, stays within budget, and can be audited after incidents.

1) Define the workflow contract
- Write a one-sentence job: “The agent will ____.”
- Define success metrics (example): <1% unsafe action attempts, >90% task completion, p95 latency <8s.
- Define failure outcomes that are acceptable (example): “Ask a human,” “Create a ticket,” “Return ‘unknown’.”

2) Map blast radius
- List every tool/action the agent can take.
- Tag each action as Read / Write / Irreversible.
- For each Write/Irreversible action, define required approvals and maximum allowed impact (e.g., refund cap $50).

3) Build typed tools (before prompts)
- Use strict schemas (JSON Schema or function signatures).
- Enforce enums for status codes and reason codes.
- Make tools idempotent (safe retries) and add rollback where possible.

4) Add policy enforcement
- Implement least privilege credentials per tool.
- Add a policy gate (OPA or Amazon Cedar) that checks: user identity, data sensitivity tier, action type, limits.

5) Govern retrieval (RAG)
- Use permission-aware indexing (document ACLs).
- Add source allowlists and sensitivity tiers (Public / Internal / Confidential / Regulated).
- Store retrieval receipts: doc IDs, snippets, timestamps, identity used.

6) Create an evaluation set
- Collect 200–500 real traces (anonymized) representing your core workflow.
- Label outcomes: correct, incorrect, unsafe, needs-human, ambiguous.
- Add adversarial tests: prompt injection, conflicting instructions, partial tool failures.

7) Build the EvalOps pipeline
- Run regressions in CI on every: prompt change, tool change, model change, retrieval change.
- Track: tool-call validity, action error rate, escalation rate, cost per task, p95 latency.
- Add drift monitoring (weekly): compare current metrics to baseline.

8) Budget controls
- Set max tool calls per task, max total tokens, and timeouts.
- Add a router rule: cheap model first; escalate to expensive model only when confidence is below threshold.
- Create alerts for spend anomalies (daily and monthly envelopes).

9) Observability + incident readiness
- Require trace IDs for every run.
- Log: prompts, structured outputs, tool parameters, policy decisions, retrieval receipts.
- Ship a kill switch: disable tools, downgrade to read-only mode, or route to humans.

10) Rollout plan
- Start with a shadow mode (agent proposes actions; humans execute).
- Move to gated mode (agent executes low-risk actions; humans approve high-risk).
- Only then enable full automation with SLOs and on-call ownership.

If you can’t produce evidence for steps 2, 6, 7, and 9, you don’t have a production agent—you have a demo.