Agentic Ops Readiness Checklist (2026)

Use this checklist before you allow any AI agent to execute real actions (emails, refunds, deploys, data updates). Score each item as: Not started / In progress / Done.

1) Scope & success criteria
- Define one workflow with a measurable outcome (e.g., “resolve tier-1 tickets,” “open Jira issues with correct routing”).
- Write a success definition that includes quality + time + cost (e.g., Task Success Rate target, Human Intervention Rate target, Cost per Successful Task ceiling).
- Enumerate “never do” actions (e.g., delete customer data, email external without approval).

2) Identity & permissions
- Issue the agent a distinct identity (not a shared human token).
- Enforce least-privilege scopes per tool (separate roles for read-only vs write).
- Use short-lived credentials where possible; rotate secrets; log all auth failures.
- Add a tiered action policy (read-only, draft, low-risk write, high-risk write, production control).

3) Tool contracts & safety
- Define typed tool schemas (inputs/outputs) and validate all outputs.
- Make side-effect tools idempotent (idempotency keys) and support dry-run.
- Add deterministic post-checks (e.g., refund amount limits, allowed recipients, environment locks).
- Implement approval gates for high-risk actions (thresholds, two-person rule).

4) Evals & testing
- Build an offline eval set with at least 50–200 representative cases from real work.
- Add adversarial cases: ambiguous requests, missing data, policy violations, prompt injection.
- Run shadow mode on real traffic for 2–4 weeks; compare to human outcomes.
- Version prompts, tools, and policies so you can reproduce results.

5) Observability & replay
- Capture structured traces: model version, prompts, retrieved docs (or hashes), tool calls, tool responses, policy decisions, and final actions.
- Redact sensitive fields (PII, secrets) in logs; set retention (e.g., 30–365 days) by risk tier.
- Implement replay for incidents: re-run with the same context to debug.
- Monitor: Task Success Rate, Human Intervention Rate, Policy Blocks, and Cost per Successful Task.

6) Cost controls
- Set budgets by workflow (daily/weekly spend caps) and alerts for anomalies.
- Implement graceful degradation: smaller model routing, reduced retrieval depth, or require human approval under load.
- Attribute spend to outcomes (cost per resolution/merge/refund handled correctly).

7) Launch gates
- Start with Tier 0–1 autonomy (read-only/draft) before Tier 2+.
- Require SLOs before expanding scope (e.g., TSR ≥ 90%, stable CPST, declining HIR).
- Run an incident tabletop exercise (wrong email, wrong refund, infinite loop, data exposure).
- Document an “agent kill switch” and on-call ownership.

If you cannot answer: “What did the agent do, who approved it, and what did it cost?” you’re not ready for production autonomy.