ICMD AgentOps Launch Checklist (90-Day Plan)

Purpose
Use this checklist to ship one production-grade agent workflow in ~90 days. The goal is not a clever demo—it’s a controlled system with measurable reliability, bounded cost, and auditable behavior.

1) Workflow Selection (Week 1)
- Pick ONE workflow with clear success criteria (e.g., “resolve ticket,” “draft PR,” “match invoice”).
- Define success metrics: correctness %, policy compliance %, time-to-complete, and human escalation rate.
- Set a maximum acceptable failure cost (e.g., “any refund > $250 requires approval”).
- Collect 500–5,000 historical examples; label outcomes and edge cases.

2) Platform Foundations (Weeks 2–4)
- Tracing: log model, prompt version, tool calls, retrieval sources, latency, and tokens per run.
- Outcome logging: record real-world effects (ticket closed, refund issued, PR merged) and reversals.
- Prompt/policy versioning: store artifacts in Git; require reviews for changes.
- Tool gateway: enforce allowlists, JSON schema validation, and rate limits for every tool call.

3) Guardrails & Security (Weeks 3–6)
- Access control: per-user auth context for retrieval; avoid shared “superuser” tokens.
- Data protection: redact PII/PHI in prompts and logs; run DLP checks on outputs.
- Policy gates as code: deterministic checks before any irreversible action.
- Kill switch: per-workflow ability to disable auto-actions instantly.

4) Evaluation & Release (Weeks 5–10)
- Offline eval: replay historical tasks nightly; track regressions by prompt/model version.
- Canary: run 1–5% of traffic with strict budgets and enhanced logging.
- Adversarial tests: prompt-injection attempts, tool misuse attempts, cross-tenant access probes.
- Budget caps: set max tool calls, max steps, and max $ spend per task.

5) Human-in-the-Loop Graduation (Weeks 9–12)
- Start in “suggestion mode” (human executes); measure time saved and correction rate.
- Promote low-risk subsets to “auto mode”; keep approvals for high-risk actions.
- Define escalation UX: the agent must hand off with a concise trace and citations.

6) Ongoing Operations (Post-launch)
- Weekly review of top failure modes and cost hotspots; prioritize fixes by $ impact.
- Drift monitoring: alert on changes in escalation rate, policy flags, and cost/task (+15% threshold).
- Incident process: severity levels, postmortems, and rollback playbooks for agent regressions.

Deliverable Definition (Done Means)
- You can replay any run, explain any action, cap spend per task, and halt automation within 15 minutes.