ICMD AgentOps 90-Day Launch Pack

Use this to ship one production-grade agent (internal or customer-facing) without creating a security or reliability debt bomb.

1) Scope & ROI (Days 0–15)
- Pick ONE workflow with clear volume and outcome (e.g., “resolve password reset tickets,” “summarize CI failures,” “draft renewal quotes”).
- Define success metrics: target % autonomy, accuracy threshold, max latency (p95), and max $ cost per task.
- Write a “permission contract” for the agent: allowed tools, allowed data classes, forbidden actions, and escalation rules.
- Establish baseline: current human handle time, error rate, and monthly volume. Convert to $ impact.

2) Architecture (Days 16–35)
- Orchestrator: choose state machine/workflow style (graph/steps) and ensure every step is logged.
- Tools: split read vs write tools; add schemas; enforce idempotency keys on side-effectful calls.
- Retrieval: implement provenance (source, timestamp, ACL); prefer small authoritative snippets over full-doc dumps.
- Budgets: hard caps on tool calls, wall time, and cost per run.

3) Security & Governance (Days 16–60)
- Least privilege: per-tool scopes; rotate credentials; never embed overprivileged API keys in prompts.
- Policy gate: evaluate each planned tool call (resource allowlist, user auth, data class, thresholds).
- Prompt injection posture: treat retrieved text as untrusted; block “instructions” from sources; red-team with poisoned docs.
- Data handling: log classification (PII/PCI/secrets/internal); redact outputs to external channels.

4) Evaluation (Days 36–60)
- Build an eval suite from real cases (100–500): include edge cases and failure examples.
- Define pass/fail criteria per case: correctness, citations/provenance, policy compliance, and budget compliance.
- Run evals in CI: compare prompt/model/retrieval changes; require approvals for regressions.

5) Pilot & Human-in-the-Loop (Days 61–75)
- Start read-only → propose-only → execute with approvals.
- Implement reviewer UI: show plan, retrieved sources, tool calls, and a one-click “escalate” path.
- Track: autonomy rate, reviewer override rate, top failure categories, and time saved.

6) Production Ops (Days 76–90)
- SLOs: define p95 latency, error rate, and policy-violation rate. Set alert thresholds.
- Runbooks: “tool down,” “rate limited,” “bad retrieval,” “policy block spike,” “cost spike.”
- Incident process: categorize by layer (tool/state/context/policy/reasoning) and write postmortems.
- Rollback: ensure actions can be reversed or corrected; store action logs for audits.

Go/No-Go Gate
- Go if: eval pass rate meets threshold on low-risk cases, no high-severity policy violations in pilot, and costs/latency are inside budgets.
- No-Go if: you can’t reproduce failures, can’t explain action provenance, or can’t enforce least privilege at the tool layer.

Operating cadence after launch
- Weekly: review top 10 failures + cost drivers.
- Monthly: refresh eval suite with new real cases; rotate credentials; re-run red-team prompts.
- Quarterly: expand scope by one tool or one workflow stage; re-assess SLOs and autonomy targets.