AgentOps Readiness Checklist (2026)

Use this checklist before you increase autonomy (e.g., allowing write actions, customer-facing decisions, or unattended execution). Score each section as: Not Started / Partial / Done. If any “Safety & Governance” item is Not Started, do not expand autonomy.

1) Workflow & ROI Definition
- Define one primary workflow (not a general assistant). Document start/end states.
- Choose 2–3 business KPIs (e.g., deflection %, handle-time reduction minutes, CSAT delta).
- Define “high-severity error” examples with real business impact (privacy leak, wrong refund, wrong account change).
- Set an initial error budget (e.g., <=1% high-severity failures; <=5% medium severity).

2) Data & Grounding
- Identify authoritative sources (owned docs, contracts, product specs). Name the “source of truth.”
- Implement citations or evidence links for any factual or policy claim.
- Add content freshness rules (e.g., do not use docs older than X months for pricing/policy).
- Document what data is explicitly excluded (HR, regulated data, PII categories).

3) Tooling & Permissions (Least Privilege)
- Separate read-only tools from write tools.
- Enforce structured tool inputs (JSON schema) and server-side validation.
- Require idempotency keys for write actions to prevent duplicate execution.
- Implement scoped, short-lived credentials; map access to IAM roles.
- Add approval tokens for high-impact actions (bulk email, refunds over $X, account deletions).

4) Evaluation & Testing
- Create a golden set of 100–500 labeled tasks from real traffic.
- Add adversarial tests: prompt injection, jailbreak attempts, tool misuse, ambiguous instructions.
- Track metrics weekly: task success rate, citation correctness, policy adherence, escalation precision.
- Gate releases in CI: do not ship if regression exceeds your threshold (e.g., -2% task success).

5) Observability & Incident Response
- Implement tracing across: prompt → retrieval → model calls → tool calls → final response.
- Log: model/version, prompt template version, retrieved doc IDs, tool parameters (redacted), latency, cost estimate.
- Add replay tooling for failed runs (same inputs + same tool outputs when possible).
- Define incident runbooks: rollback, disable write tools, switch to safe mode, notify security.

6) Cost & Latency Control
- Set budgets: cost per task (e.g., <$0.25) and latency (e.g., P95 < 10s interactive).
- Implement routing: cheap model for extraction/classification; expensive model for complex reasoning.
- Add caching where safe (retrieval results, embeddings, deterministic transforms).
- Monitor usage anomalies (sudden token spikes, tool-call loops, repeated retries).

7) Human-in-the-Loop Operations
- Define escalation triggers (low confidence, missing citations, policy conflict, tool failure).
- Provide operators with a clear UI: traces, sources, tool calls, and “why” justification.
- Create feedback capture: thumbs up/down + reason codes + corrected answer.
- Schedule audits (e.g., 50 random cases/week) and track audit pass rate.

Exit Criteria (Ready to Expand Autonomy)
- Eval suite covers top intents + top risks; passing at target thresholds for 2 consecutive weeks.
- Observability supports end-to-end replay and root-cause analysis within 30 minutes.
- Write actions are protected by policy checks and approvals; audit logs are complete.
- Cost/latency dashboards exist; routing is live; budgets are consistently met.

If you meet the exit criteria, expand autonomy in one dimension at a time (scope, permissions, or volume)—never all three at once.