ICMD 90-DAY AGENT LAUNCH PACK (2026) Use this as a working doc for your first production agent. Goal: bounded autonomy with measurable reliability. 1) WORKFLOW SELECTION (Day 1–3) - Pick ONE workflow with high volume and clear outcomes (e.g., “refund requests under $200,” “reset MFA,” “triage top 15 support macros”). - Define the atomic unit of work (a “task”). - Define hard constraints: max $ amount, allowed systems, allowed customer segments, and “must escalate if…” conditions. - Write a one-paragraph user story and a one-paragraph failure story (what goes wrong and who gets hurt). 2) METRICS + SLOS (Day 3–7) Define targets before you build: - Task Success Rate (TSR): % tasks completed correctly end-to-end. - Severe Error Rate (SER): % tasks with policy violation, wrong account/action, or irreversible harm. - Escalation Rate (ER): % tasks routed to humans. - Rework Rate (RR): % tasks that humans must redo. - Cost per Task (CPT): model + infra + tool calls. - Latency: P50/P95 wall-clock time. Suggested initial targets: TSR 70–85%, SER <0.5%, CPT <$0.25, P95 <20s (adjust by domain). 3) TOOLING + PERMISSIONS (Week 2–4) - Build a typed tool SDK: strict schemas, input validation, and documented side effects. - Add idempotency keys for any write action. - Add dry-run mode for every write-capable tool. - Create least-privilege credentials per tenant; never reuse global admin tokens. - Implement approval gates for sensitive actions (refunds, deletions, payroll, external emails). 4) EVALUATION HARNESS (Week 4–6) - Assemble 200–1,000 real cases (anonymized) that represent production distribution. - Build offline regression tests that run on every prompt/model/tool change. - Track: TSR, SER, CPT, and “reason for failure” taxonomy. - Include adversarial cases: missing fields, ambiguous instructions, conflicting policies, angry customer tone. 5) OBSERVABILITY + AUDIT (Week 5–8) - Log per run: run_id, model, prompts (or hashes), retrieved docs, tool calls, tool outputs, latency, cost, confidence. - Store an immutable audit trail suitable for security review. - Add kill switch (global + per-tenant) and rollback strategy. - Add anomaly alerts: SER spike, tool error spike, unusually high cost/task, or repeated retries. 6) LAUNCH PLAN (Week 7–12) - Phase 1: supervised autonomy (agent proposes; human approves). Target approval rate >60%. - Phase 2: auto-exec low-risk actions only (tagging, routing, drafting, field suggestions). - Phase 3: expand autonomy with explicit policy updates and updated eval suites. - Run canaries: 1% → 5% → 25% traffic with rollback criteria. 7) ENTERPRISE GOVERNANCE PACK (Prepare by Week 10–12) - Security one-pager: data flow, retention, encryption, training policy, subprocessors. - Controls checklist: SSO/SAML, SCIM, RBAC, audit logs, key management, incident response. - Compliance roadmap: SOC 2 Type I → Type II timeline; HIPAA BAA if needed. - Customer-facing “Agent Policy”: what it can do, what it will never do, and how it escalates. If you can’t quantify success/failure, you can’t sell autonomy. Treat this like payments or SRE: define blast radius, instrument everything, and earn trust step by step.