AGENTIC WORKFLOW LAUNCH CHECKLIST (90 DAYS)

Goal: Ship one production-grade agentic workflow with measurable ROI, bounded risk, and auditable controls.

1) Scope (Week 1)
- Pick ONE workflow with: high repetition, clear KPI, low blast radius.
- Write the KPI with baseline and target (examples: “reduce support handle time from 14 min to 10 min”, “cut security questionnaire turnaround from 5 days to 2 days”).
- Define what the agent is NOT allowed to do (e.g., send messages, modify billing, deploy code).

2) Data & Access (Weeks 1–2)
- Inventory data sources the workflow needs (KB, CRM, ticketing, docs, code).
- Decide retrieval approach: only governed sources; no “write-everything memory” in v1.
- Create service accounts for tools/APIs; enforce least privilege.
- Define retention rules for logs (e.g., 30 days by default; longer only if needed).

3) Output Contract (Weeks 2–3)
- Specify a strict output schema (fields, required sections, format).
- Require citations (doc IDs/URLs + quoted spans) for factual claims.
- Define acceptable failure: the agent can say “I don’t know” and escalate.
- Create a rubric (0–2 or 0–5 scale) for: accuracy, completeness, tone, safety.

4) Controls (Weeks 3–4)
- Tool allowlist + JSON schema validation for every tool call.
- Hard limits: max steps, max tool calls, timeout, max tokens.
- Cost budget per run in USD; block/abort when exceeded.
- Policy gates: secret/PII detection; block external output if triggered.
- Human approval required for any customer-facing output in v1.

5) Evaluation Harness (Weeks 4–6)
- Build an eval set of 50–200 real cases (start with past tickets/requests).
- Add “hard cases”: edge conditions, ambiguous prompts, known failure modes.
- Establish pass thresholds (example: >=90% schema compliance, >=80% accuracy score).
- Run evals in CI on every change: prompt, retrieval, tool schema, model.

6) Observability (Weeks 5–7)
- Log: inputs, plan, tool calls + tool outputs, final output, and cost.
- Create a weekly sampling process (e.g., review 20 random runs + all flagged runs).
- Define an SLO for the workflow (example: “85% of drafts accepted with minor edits”).

7) Launch (Weeks 7–10)
- Pilot with a small internal group or 5–10% of traffic.
- Keep a rollback switch (feature flag) and clear ownership (one DRI).
- Track KPI, acceptance rate, escalation rate, and cost per run daily.

8) Expand Autonomy (Weeks 10–13)
- Only expand scope after 2–3 consecutive weeks meeting thresholds.
- Add new tools one at a time; update eval set with new failure modes.
- Consider limited autonomy for reversible actions (e.g., draft-only, propose-only).

Definition of Done (for v1)
- KPI improved vs baseline.
- Audit logs exist for every run.
- Budgets/timeouts enforced.
- Evals run in CI with documented thresholds.
- Human approval gate for external actions.

Suggested weekly cadence
- 30 min “agent retro”: top failures, top wins, eval updates, policy updates, cost review.