AGENTIC WORKFLOW LAUNCH CHECKLIST (2026)

Goal: Launch one production agent workflow with measurable ROI and controlled risk in 30–60 days.

1) Workflow Selection (Week 0)
- Pick a workflow with: (a) ≥200 tasks/month, (b) clear start/end states, (c) low-to-moderate risk.
- Write a one-page “Definition of Done”: inputs, outputs, edge cases, and escalation rules.
- Define the KPI trio:
  - Business: cost per outcome (e.g., $/ticket resolved, $/qualified lead)
  - Quality: severity-weighted accuracy (define P0/P1/P2 errors)
  - Ops: tool success rate + latency (p95)

2) Data + Evals (Week 1)
- Build an initial eval set of 100–200 historical cases.
- Label each case with:
  - Expected action (approve/deny/escalate)
  - Required evidence (which fields/docs must be present)
  - Severity of wrong output (P0/P1/P2)
- Establish a regression rule: no model/prompt/tool change ships without running the eval suite.

3) Tooling + Safety (Weeks 1–2)
- Wrap every tool/API with:
  - Typed schemas (structured inputs/outputs)
  - Idempotency keys for write actions
  - Allowlisted parameters (no free-form deletes/updates)
  - Least-privilege credentials (no shared admin tokens)
- Add explicit human approval gates for high-severity actions (e.g., refunds above $X, production deploys).

4) Build + Shadow Mode (Weeks 2–3)
- Run the agent in shadow mode for 1–2 weeks.
- Track:
  - Containment rate (how often the agent could have completed without human)
  - Disagreement rate vs. humans
  - P0 error rate (must be near-zero before autonomy)
- Create a weekly review ritual: top failures, root causes, fixes, and new eval cases.

5) Graduated Autonomy (Weeks 4–6)
- Start with low-risk actions only (e.g., tagging, drafting, suggesting).
- Move to limited write actions with caps (e.g., refunds ≤$50, changes only in sandbox).
- Define rollback triggers (examples):
  - P0 error detected
  - Tool failure rate >2% daily
  - Cost per outcome increases week-over-week for 2 weeks

6) Monitoring + Governance (Ongoing)
- Log everything needed to replay incidents: prompts, retrieved docs, tool calls, outputs, reviewer decisions.
- Build dashboards:
  - Cost per successful outcome
  - Escalation rate + reasons
  - Tool call success rate
  - Drift indicators (changes in input mix, rising uncertainty)
- Assign owners:
  - Functional owner (Support/Sales/Finance) for KPI outcomes
  - Platform/Security owner for permissions, logging, and credential hygiene
  - Agent Ops owner for eval maintenance and incident reviews

Ship criteria (minimum):
- ≥200-case eval suite in CI
- 100% tool calls behind typed wrappers
- Human gate for all high-severity actions
- Ability to reproduce any incident within 24 hours
- Clear unit economics dashboard tied to business KPIs