AI EMPLOYEE LAUNCH KIT (90 DAYS) GOAL Ship a production-grade “AI employee” that owns a bounded outcome (not a feature), with measurable reliability, auditable controls, and sustainable unit economics. PHASE 1 (DAYS 1–30): PICK THE WEDGE + SHIP SUPERVISED 1) Choose a single unit of work: - One sentence outcome (e.g., “Resolve Tier-1 billing tickets under $100”). - Define what “done” means (status change, record updated, customer notified). - Define a structured output schema (JSON) for decisions. 2) Baseline ROI before automation: - Current volume/week, median cycle time, backlog size. - Human cost per unit (loaded cost ÷ units). 3) Integrations: - Connect 1–2 systems of record (e.g., Zendesk + Stripe; NetSuite + email). - Implement least-privilege credentials from day one. 4) Supervised execution: - Build a review queue UI (approve/edit/reject). - Require approvals for all actions initially. Metrics to track weekly: - Success rate (human-approved without edits) - Escalation rate (agent requests help) - Avg tool calls per run - Cost per successful task (compute + infra) - Median/95p latency PHASE 2 (DAYS 31–60): CONTROLS + EVALS + COST ENGINEERING 5) Control plane: - Policy rules: allowed actions, thresholds, approvals. - Audit logs: every tool call input/output + correlation IDs. - Environment separation: dev/stage/prod. 6) Evaluation harness: - Create a labeled dataset from reviewed runs (minimum 200–500 examples). - Nightly replay tests; gate releases on success thresholds. - Red-team suite: prompt injection, data leakage, tool misuse. 7) Cost engineering: - “Cheap-first” routing (classifier/extractor before large model). - Cap loops and retries. - Cache retrieval results and common responses. Targets by day 60: - ≥85–90% success on eval set - Clear per-task cost model - No high-risk actions without explicit approvals PHASE 3 (DAYS 61–90): INCREASE AUTONOMY + EXPAND 8) Autonomy ladder: - Tier tasks by risk (low/medium/high). - Low-risk: auto-execute with post-hoc sampling. - Medium-risk: auto-execute with confidence + rollback. - High-risk: approvals required (e.g., refunds > $500, contract edits). 9) Incident readiness: - On-call rotation and runbooks. - Kill switch per customer. - Rollback and replay tooling. 10) Go-to-market assets: - ROI dashboard template for customers. - Security packet: SOC2 status, data retention, DPA, architecture diagram. - Case study with before/after: cycle time, backlog reduction, cost per unit. Targets by day 90: - ≥90–95% success in production for the wedge workflow - Gross margin trajectory to ≥75% at target volume - One adjacent workflow pilot leveraging the same integrations DECISION RULES - If success rate drops >5% week-over-week: reduce autonomy and investigate. - If tool calls per task rises >10%: optimize prompts/routing/caching. - If escalation rate remains >20% after day 60: tighten scope or improve policies/UI. DELIVERABLES CHECKLIST - Workflow spec + JSON schema - Review queue UX - Policy engine + permissions - Full audit logs + run replay - Nightly eval harness + red-team tests - Cost per successful task dashboard - Autonomy ladder + kill switch - Customer-facing ROI + security packet