ICMD 90-DAY AGENT LAUNCH PACK (2026)

Use this as a working doc for your first production agent. Goal: bounded autonomy with measurable reliability.

1) WORKFLOW SELECTION (Day 1–3)
- Pick ONE workflow with high volume and clear outcomes (e.g., “refund requests under $200,” “reset MFA,” “triage top 15 support macros”).
- Define the atomic unit of work (a “task”).
- Define hard constraints: max $ amount, allowed systems, allowed customer segments, and “must escalate if…” conditions.
- Write a one-paragraph user story and a one-paragraph failure story (what goes wrong and who gets hurt).

2) METRICS + SLOS (Day 3–7)
Define targets before you build:
- Task Success Rate (TSR): % tasks completed correctly end-to-end.
- Severe Error Rate (SER): % tasks with policy violation, wrong account/action, or irreversible harm.
- Escalation Rate (ER): % tasks routed to humans.
- Rework Rate (RR): % tasks that humans must redo.
- Cost per Task (CPT): model + infra + tool calls.
- Latency: P50/P95 wall-clock time.
Suggested initial targets: TSR 70–85%, SER <0.5%, CPT <$0.25, P95 <20s (adjust by domain).

3) TOOLING + PERMISSIONS (Week 2–4)
- Build a typed tool SDK: strict schemas, input validation, and documented side effects.
- Add idempotency keys for any write action.
- Add dry-run mode for every write-capable tool.
- Create least-privilege credentials per tenant; never reuse global admin tokens.
- Implement approval gates for sensitive actions (refunds, deletions, payroll, external emails).

4) EVALUATION HARNESS (Week 4–6)
- Assemble 200–1,000 real cases (anonymized) that represent production distribution.
- Build offline regression tests that run on every prompt/model/tool change.
- Track: TSR, SER, CPT, and “reason for failure” taxonomy.
- Include adversarial cases: missing fields, ambiguous instructions, conflicting policies, angry customer tone.

5) OBSERVABILITY + AUDIT (Week 5–8)
- Log per run: run_id, model, prompts (or hashes), retrieved docs, tool calls, tool outputs, latency, cost, confidence.
- Store an immutable audit trail suitable for security review.
- Add kill switch (global + per-tenant) and rollback strategy.
- Add anomaly alerts: SER spike, tool error spike, unusually high cost/task, or repeated retries.

6) LAUNCH PLAN (Week 7–12)
- Phase 1: supervised autonomy (agent proposes; human approves). Target approval rate >60%.
- Phase 2: auto-exec low-risk actions only (tagging, routing, drafting, field suggestions).
- Phase 3: expand autonomy with explicit policy updates and updated eval suites.
- Run canaries: 1% → 5% → 25% traffic with rollback criteria.

7) ENTERPRISE GOVERNANCE PACK (Prepare by Week 10–12)
- Security one-pager: data flow, retention, encryption, training policy, subprocessors.
- Controls checklist: SSO/SAML, SCIM, RBAC, audit logs, key management, incident response.
- Compliance roadmap: SOC 2 Type I → Type II timeline; HIPAA BAA if needed.
- Customer-facing “Agent Policy”: what it can do, what it will never do, and how it escalates.

If you can’t quantify success/failure, you can’t sell autonomy. Treat this like payments or SRE: define blast radius, instrument everything, and earn trust step by step.