AGENTIC OPS LAUNCH CHECKLIST (2026)

Goal: Ship an AI agent that takes actions in real systems with predictable unit economics, measurable reliability, and auditable controls.

1) SCOPE & SUCCESS METRICS
- Define ONE primary workflow (e.g., “password reset end-to-end”, “invoice status requests”).
- Success metric: % tasks completed without human action (target an initial bar like 60–75%).
- Safety metric: % tasks that properly escalate when uncertain (target 100% for high-risk cases).
- Quality metric: escalation packet value (measure minutes saved for the human; target 3–5 minutes saved).
- Cost metric: cost per successful outcome (set a target like <$0.50 per L1 ticket or <$1 per intake).

2) TOOLING & PERMISSIONS (LEAST PRIVILEGE)
- Create a dedicated agent identity (service account/service principal), not shared credentials.
- Start read-only: allow “get/list” tools before any “create/update/delete.”
- For write tools: require idempotency keys and deterministic validation.
- Add explicit approval gates for money movement, external messages, access changes, and deletions.
- Maintain an allowlist of tools and data domains the agent can access.

3) DATA GOVERNANCE
- Redact or mask PII by default in logs and traces.
- Define retention: e.g., 30 days for traces containing user text unless compliance requires longer.
- Restrict retrieval: ensure untrusted content cannot alter system instructions (treat as data).
- Document data residency needs (EU/US), and confirm vendors meet them.

4) RELIABILITY ENGINEERING
- Enforce hard budgets: max tool calls, max tokens, max wall-clock time; on breach, stop + escalate.
- Build retry logic for tool calls with backoff; add compensation logic for partial failures.
- Implement typed schemas for all tool inputs/outputs; reject invalid JSON deterministically.
- Create a runbook: known failure modes, on-call actions, rollback procedure.

5) EVALS & TESTING
- Collect 200–500 anonymized real cases (“golden traces”).
- Include adversarial cases (prompt injection, conflicting instructions, ambiguous requests).
- Run eval suite in CI on every prompt/tool change; define pass/fail thresholds.
- Add canary releases: ship to 1–5% traffic first; monitor regressions.

6) OBSERVABILITY
- Trace every agent run: prompt versions, tool calls, tool latency, retries, final decision.
- Dashboards: success rate, escalation rate, mean tool calls, mean tokens, cost per outcome.
- Alerting: sudden spike in retries, tool failures, cost per outcome, or loop detection.

7) ROLLOUT PLAN
- Shadow mode first: agent proposes actions, human executes; measure accuracy + time saved.
- Limited autonomy next: enable low-risk writes behind approvals.
- Expand scope one tool/permission at a time; rerun evals; re-approve security.

8) POST-LAUNCH OPERATIONS
- Weekly review: top failure categories, new edge cases, cost regressions.
- Versioning: prompts, tool schemas, and policies must be versioned and replayable.
- Quarterly audit: permissions review, access logs, retention compliance, incident postmortems.

If you cannot answer “What did the agent do, with what permissions, using which tool calls, at what cost, and why?” you are not ready for broad autonomy.