AgentOps Readiness Checklist (2026) Use this checklist before you increase autonomy (e.g., allowing write actions, customer-facing decisions, or unattended execution). Score each section as: Not Started / Partial / Done. If any “Safety & Governance” item is Not Started, do not expand autonomy. 1) Workflow & ROI Definition - Define one primary workflow (not a general assistant). Document start/end states. - Choose 2–3 business KPIs (e.g., deflection %, handle-time reduction minutes, CSAT delta). - Define “high-severity error” examples with real business impact (privacy leak, wrong refund, wrong account change). - Set an initial error budget (e.g., <=1% high-severity failures; <=5% medium severity). 2) Data & Grounding - Identify authoritative sources (owned docs, contracts, product specs). Name the “source of truth.” - Implement citations or evidence links for any factual or policy claim. - Add content freshness rules (e.g., do not use docs older than X months for pricing/policy). - Document what data is explicitly excluded (HR, regulated data, PII categories). 3) Tooling & Permissions (Least Privilege) - Separate read-only tools from write tools. - Enforce structured tool inputs (JSON schema) and server-side validation. - Require idempotency keys for write actions to prevent duplicate execution. - Implement scoped, short-lived credentials; map access to IAM roles. - Add approval tokens for high-impact actions (bulk email, refunds over $X, account deletions). 4) Evaluation & Testing - Create a golden set of 100–500 labeled tasks from real traffic. - Add adversarial tests: prompt injection, jailbreak attempts, tool misuse, ambiguous instructions. - Track metrics weekly: task success rate, citation correctness, policy adherence, escalation precision. - Gate releases in CI: do not ship if regression exceeds your threshold (e.g., -2% task success). 5) Observability & Incident Response - Implement tracing across: prompt → retrieval → model calls → tool calls → final response. - Log: model/version, prompt template version, retrieved doc IDs, tool parameters (redacted), latency, cost estimate. - Add replay tooling for failed runs (same inputs + same tool outputs when possible). - Define incident runbooks: rollback, disable write tools, switch to safe mode, notify security. 6) Cost & Latency Control - Set budgets: cost per task (e.g., <$0.25) and latency (e.g., P95 < 10s interactive). - Implement routing: cheap model for extraction/classification; expensive model for complex reasoning. - Add caching where safe (retrieval results, embeddings, deterministic transforms). - Monitor usage anomalies (sudden token spikes, tool-call loops, repeated retries). 7) Human-in-the-Loop Operations - Define escalation triggers (low confidence, missing citations, policy conflict, tool failure). - Provide operators with a clear UI: traces, sources, tool calls, and “why” justification. - Create feedback capture: thumbs up/down + reason codes + corrected answer. - Schedule audits (e.g., 50 random cases/week) and track audit pass rate. Exit Criteria (Ready to Expand Autonomy) - Eval suite covers top intents + top risks; passing at target thresholds for 2 consecutive weeks. - Observability supports end-to-end replay and root-cause analysis within 30 minutes. - Write actions are protected by policy checks and approvals; audit logs are complete. - Cost/latency dashboards exist; routing is live; budgets are consistently met. If you meet the exit criteria, expand autonomy in one dimension at a time (scope, permissions, or volume)—never all three at once.