AgentOps Readiness Checklist (2026)

Use this checklist to ship one production agent workflow safely. Score each line as: Not Started / In Progress / Done.

1) Workflow Definition
- Define the workflow boundary in one sentence (input, output, and system of record).
- List “must-never-happen” failures (e.g., refund > $500, emailing wrong customer, deleting data).
- Define a human fallback path and the maximum time before escalation (e.g., 60 seconds).

2) Tooling & Permissions
- Inventory tools the agent can call; separate read-only tools from write tools.
- Implement an allowlist: only approved tools are callable in production.
- Enforce least privilege: scoped OAuth tokens, short-lived credentials, no raw API keys in prompts.
- Add schema validation (JSON Schema/OpenAPI) for every tool request and response.
- Add hard budgets: max tool calls per run, max runtime, and idempotency keys for write actions.

3) Observability & Audit
- Capture end-to-end traces per run: model calls, tool calls, retrieval steps, and policy decisions.
- Log prompt/version identifiers (prompt hash, model version, routing policy version).
- Store tool inputs/outputs with redaction; confirm PII is not retained in plaintext logs.
- Implement replay for incidents: reconstruct a run with the same inputs and versions.
- Set data retention (e.g., 7–30 days for traces; longer only for audit requirements).

4) Evaluation Gates
- Build an offline eval set from real historical cases (start with 200; grow to 1,000+).
- Define pass/fail metrics: success rate, policy compliance rate, and escalation rate.
- Add regression checks to CI/CD: block deploy if core metrics degrade beyond thresholds.
- Run canary testing (e.g., 5% traffic) and compare against baseline before ramp.

5) Reliability Engineering
- Fail closed by default: if parsing fails, tool errors occur, or confidence is low—escalate.
- Add timeouts and retries with backoff; ensure retries are safe (idempotent writes).
- Implement rate limits per tenant and per workflow to prevent runaway loops.
- Track p95 latency and set an explicit target (e.g., <8 seconds for a customer-facing action).

6) Cost & Unit Economics
- Measure cost per successful completion (CPSC) per workflow.
- Implement model routing: cheap model for triage/extraction; premium model for complex reasoning.
- Add caching for retrieval results and deterministic sub-steps.
- Set budget alerts (daily/weekly) and define an automatic “degrade mode” (e.g., switch to cheaper model).

7) Security & Compliance
- Run a threat model focused on tool abuse, data exfiltration, and prompt injection.
- Require approvals for high-risk actions (e.g., payouts, permission changes, production deploys).
- Perform quarterly access reviews for tools and service accounts used by agents.
- Prepare an audit packet: architecture diagram, logging/redaction approach, and incident response steps.

If you can check “Done” on at least 80% of these items for one workflow, you’re ready to scale to the next workflow with the same platform patterns.