Agentic Ops Readiness Checklist (2026) Use this checklist before you allow any AI agent to execute real actions (emails, refunds, deploys, data updates). Score each item as: Not started / In progress / Done. 1) Scope & success criteria - Define one workflow with a measurable outcome (e.g., “resolve tier-1 tickets,” “open Jira issues with correct routing”). - Write a success definition that includes quality + time + cost (e.g., Task Success Rate target, Human Intervention Rate target, Cost per Successful Task ceiling). - Enumerate “never do” actions (e.g., delete customer data, email external without approval). 2) Identity & permissions - Issue the agent a distinct identity (not a shared human token). - Enforce least-privilege scopes per tool (separate roles for read-only vs write). - Use short-lived credentials where possible; rotate secrets; log all auth failures. - Add a tiered action policy (read-only, draft, low-risk write, high-risk write, production control). 3) Tool contracts & safety - Define typed tool schemas (inputs/outputs) and validate all outputs. - Make side-effect tools idempotent (idempotency keys) and support dry-run. - Add deterministic post-checks (e.g., refund amount limits, allowed recipients, environment locks). - Implement approval gates for high-risk actions (thresholds, two-person rule). 4) Evals & testing - Build an offline eval set with at least 50–200 representative cases from real work. - Add adversarial cases: ambiguous requests, missing data, policy violations, prompt injection. - Run shadow mode on real traffic for 2–4 weeks; compare to human outcomes. - Version prompts, tools, and policies so you can reproduce results. 5) Observability & replay - Capture structured traces: model version, prompts, retrieved docs (or hashes), tool calls, tool responses, policy decisions, and final actions. - Redact sensitive fields (PII, secrets) in logs; set retention (e.g., 30–365 days) by risk tier. - Implement replay for incidents: re-run with the same context to debug. - Monitor: Task Success Rate, Human Intervention Rate, Policy Blocks, and Cost per Successful Task. 6) Cost controls - Set budgets by workflow (daily/weekly spend caps) and alerts for anomalies. - Implement graceful degradation: smaller model routing, reduced retrieval depth, or require human approval under load. - Attribute spend to outcomes (cost per resolution/merge/refund handled correctly). 7) Launch gates - Start with Tier 0–1 autonomy (read-only/draft) before Tier 2+. - Require SLOs before expanding scope (e.g., TSR ≥ 90%, stable CPST, declining HIR). - Run an incident tabletop exercise (wrong email, wrong refund, infinite loop, data exposure). - Document an “agent kill switch” and on-call ownership. If you cannot answer: “What did the agent do, who approved it, and what did it cost?” you’re not ready for production autonomy.