Agentic Reliability Stack Checklist (2026)

Use this checklist before you grant an AI agent tool access in production. The goal is simple: predictable outcomes, controlled blast radius, and audit-ready operations.

1) Define the job and success metric
- Write a one-sentence job statement (e.g., “triage inbound support tickets”).
- Define a task success metric (e.g., % correctly routed) and an error budget (e.g., <1% critical policy violations).

2) Map autonomy boundaries
- List every tool/action the agent can take.
- Categorize each action as Auto / Gated / Human-required based on financial impact, reversibility, and customer visibility.

3) Design constrained tools
- Prefer narrow, typed tools (purpose-built endpoints) over generic “run SQL” or “send email” tools.
- Version tool schemas; treat breaking changes like API changes.

4) Establish agent identity
- Choose agent-as-service-account vs agent-on-behalf-of-user.
- Enforce least privilege and short-lived credentials (minutes, not days).

5) Add policy checks outside the model
- Implement pre-execution policy gates (role, tenant, amount caps, time windows, sanctions).
- Fail closed: if checks fail, do not execute.

6) Build an eval harness
- Create a golden dataset of real tasks (50–500 to start).
- Score end-to-end success, policy compliance, and tool-call correctness.

7) Add red-team scenarios
- Include prompt injection, data exfiltration attempts, and cross-tenant access tests.
- Track exploit rate over time; require improvements before expanding permissions.

8) Instrument tracing and replay
- Log prompts, retrieved sources, tool calls, inputs/outputs, approvals, and final actions.
- Ensure you can replay a run for debugging within 24 hours.

9) Set production SLOs
- Latency SLO (median and p95), cost/run, cost/success, and success rate.
- Alert on regressions after model, prompt, or tool changes.

10) Control cost proactively
- Use model routing (small model for extraction/routing; large model only when needed).
- Cap loops, retries, and maximum tool calls per run.

11) Roll out safely
- Shadow mode (agent recommends; humans execute) for 2–4 weeks.
- Feature flag: 1% → 10% → 50% → 100% as evals and SLOs hold.

12) Prepare incident response
- Assign on-call ownership.
- Implement a kill switch to disable tool execution instantly.
- Document rollback procedures for each write action.

If you can’t check off at least 10 of 12 items, your agent is still an experiment—treat it accordingly.