AgentOps Launch Checklist (2026)

Use this checklist to launch an AI agent that can take real actions (tickets, refunds, CRM changes, deployments) without creating reliability, security, or cost incidents.

1) Define the “agent run” contract
- What starts a run (user request, webhook, cron)?
- What ends a run (success, human handoff, safe failure)?
- Required outputs: action type, parameters, reason, and confidence/uncertainty flag.
- Hard budgets per run: max tool calls, max tokens, max wall time.

2) Build a scenario bank (your real edge cases)
- Collect 200–1,000 historical items (tickets, emails, tasks).
- Label outcomes: correct action, correct handoff, incorrect action, policy violation.
- Include red-team cases: prompt injection, data exfiltration attempts, policy bypass.
- Version the dataset and keep it immutable per release.

3) Implement guardrails that don’t depend on “the model behaving”
- Least-privilege credentials per tool action (avoid broad app access).
- Allowlist tools + parameters; block free-form execution (e.g., raw SQL, send-email).
- Add approval gates for high-impact actions (e.g., refunds > $200, account closure).
- Separate “draft” vs “execute” modes.

4) Create an evaluation harness (release-blocking)
- Offline evals: completion rate, policy compliance rate, hallucination rate, tool misuse rate.
- Cost evals: P50 and P95 cost per run; alert on cost regressions.
- Latency evals: P50 and P95 time-to-complete; break down by step.
- Define go/no-go thresholds (example: 0 policy violations; <0.5% hard failures).

5) Observability and audit
- Trace every run: inputs, retrieval hits, tool calls, outputs, timestamps, costs.
- Log every action with: actor=agent, permission scope, policy version, approval record.
- Redact/avoid restricted fields in logs (PII/PHI) and define retention windows.

6) Rollout plan
- Shadow mode: run agent on live traffic but don’t execute actions; compare to humans.
- Canary: 1–5% live execution with fast rollback.
- Ramp: 25% → 50% → 100% only after 1–2 weeks meeting SLOs.
- Incident playbook: on-call owner, rollback switch, postmortem template.

7) Continuous improvement loop
- Weekly: review top failure clusters from traces; add them to scenario bank.
- Monthly: re-run red-team suite; update allowlists and approval rules.
- Quarterly: revisit model routing and budgets; renegotiate vendor pricing if needed.

If you can’t answer “what happened in this bad run?” within 2 minutes using a trace, you’re not ready to scale the agent.