AI EMPLOYEE LAUNCH KIT (90 DAYS)

GOAL
Ship a production-grade “AI employee” that owns a bounded outcome (not a feature), with measurable reliability, auditable controls, and sustainable unit economics.

PHASE 1 (DAYS 1–30): PICK THE WEDGE + SHIP SUPERVISED
1) Choose a single unit of work:
- One sentence outcome (e.g., “Resolve Tier-1 billing tickets under $100”).
- Define what “done” means (status change, record updated, customer notified).
- Define a structured output schema (JSON) for decisions.

2) Baseline ROI before automation:
- Current volume/week, median cycle time, backlog size.
- Human cost per unit (loaded cost ÷ units).

3) Integrations:
- Connect 1–2 systems of record (e.g., Zendesk + Stripe; NetSuite + email).
- Implement least-privilege credentials from day one.

4) Supervised execution:
- Build a review queue UI (approve/edit/reject).
- Require approvals for all actions initially.

Metrics to track weekly:
- Success rate (human-approved without edits)
- Escalation rate (agent requests help)
- Avg tool calls per run
- Cost per successful task (compute + infra)
- Median/95p latency

PHASE 2 (DAYS 31–60): CONTROLS + EVALS + COST ENGINEERING
5) Control plane:
- Policy rules: allowed actions, thresholds, approvals.
- Audit logs: every tool call input/output + correlation IDs.
- Environment separation: dev/stage/prod.

6) Evaluation harness:
- Create a labeled dataset from reviewed runs (minimum 200–500 examples).
- Nightly replay tests; gate releases on success thresholds.
- Red-team suite: prompt injection, data leakage, tool misuse.

7) Cost engineering:
- “Cheap-first” routing (classifier/extractor before large model).
- Cap loops and retries.
- Cache retrieval results and common responses.

Targets by day 60:
- ≥85–90% success on eval set
- Clear per-task cost model
- No high-risk actions without explicit approvals

PHASE 3 (DAYS 61–90): INCREASE AUTONOMY + EXPAND
8) Autonomy ladder:
- Tier tasks by risk (low/medium/high).
- Low-risk: auto-execute with post-hoc sampling.
- Medium-risk: auto-execute with confidence + rollback.
- High-risk: approvals required (e.g., refunds > $500, contract edits).

9) Incident readiness:
- On-call rotation and runbooks.
- Kill switch per customer.
- Rollback and replay tooling.

10) Go-to-market assets:
- ROI dashboard template for customers.
- Security packet: SOC2 status, data retention, DPA, architecture diagram.
- Case study with before/after: cycle time, backlog reduction, cost per unit.

Targets by day 90:
- ≥90–95% success in production for the wedge workflow
- Gross margin trajectory to ≥75% at target volume
- One adjacent workflow pilot leveraging the same integrations

DECISION RULES
- If success rate drops >5% week-over-week: reduce autonomy and investigate.
- If tool calls per task rises >10%: optimize prompts/routing/caching.
- If escalation rate remains >20% after day 60: tighten scope or improve policies/UI.

DELIVERABLES CHECKLIST
- Workflow spec + JSON schema
- Review queue UX
- Policy engine + permissions
- Full audit logs + run replay
- Nightly eval harness + red-team tests
- Cost per successful task dashboard
- Autonomy ladder + kill switch
- Customer-facing ROI + security packet