ICMD AgentOps Launch Checklist (90-Day Plan) Purpose Use this checklist to ship one production-grade agent workflow in ~90 days. The goal is not a clever demo—it’s a controlled system with measurable reliability, bounded cost, and auditable behavior. 1) Workflow Selection (Week 1) - Pick ONE workflow with clear success criteria (e.g., “resolve ticket,” “draft PR,” “match invoice”). - Define success metrics: correctness %, policy compliance %, time-to-complete, and human escalation rate. - Set a maximum acceptable failure cost (e.g., “any refund > $250 requires approval”). - Collect 500–5,000 historical examples; label outcomes and edge cases. 2) Platform Foundations (Weeks 2–4) - Tracing: log model, prompt version, tool calls, retrieval sources, latency, and tokens per run. - Outcome logging: record real-world effects (ticket closed, refund issued, PR merged) and reversals. - Prompt/policy versioning: store artifacts in Git; require reviews for changes. - Tool gateway: enforce allowlists, JSON schema validation, and rate limits for every tool call. 3) Guardrails & Security (Weeks 3–6) - Access control: per-user auth context for retrieval; avoid shared “superuser” tokens. - Data protection: redact PII/PHI in prompts and logs; run DLP checks on outputs. - Policy gates as code: deterministic checks before any irreversible action. - Kill switch: per-workflow ability to disable auto-actions instantly. 4) Evaluation & Release (Weeks 5–10) - Offline eval: replay historical tasks nightly; track regressions by prompt/model version. - Canary: run 1–5% of traffic with strict budgets and enhanced logging. - Adversarial tests: prompt-injection attempts, tool misuse attempts, cross-tenant access probes. - Budget caps: set max tool calls, max steps, and max $ spend per task. 5) Human-in-the-Loop Graduation (Weeks 9–12) - Start in “suggestion mode” (human executes); measure time saved and correction rate. - Promote low-risk subsets to “auto mode”; keep approvals for high-risk actions. - Define escalation UX: the agent must hand off with a concise trace and citations. 6) Ongoing Operations (Post-launch) - Weekly review of top failure modes and cost hotspots; prioritize fixes by $ impact. - Drift monitoring: alert on changes in escalation rate, policy flags, and cost/task (+15% threshold). - Incident process: severity levels, postmortems, and rollback playbooks for agent regressions. Deliverable Definition (Done Means) - You can replay any run, explain any action, cap spend per task, and halt automation within 15 minutes.