Delegation With Receipts — Agent Workflow Launch Checklist (30–60 days)

Purpose
Use this checklist to launch an agent workflow that delivers measurable throughput without creating silent quality risk. The goal is a repeatable operating standard: bounded delegation + enforceable proofs.

1) Choose the workflow (Day 1–3)
- Pick 1 workflow with clear inputs/outputs and an existing human baseline (e.g., Tier-1 support tagging, internal tool scaffolding, outbound email drafts).
- Confirm the blast radius: what’s the worst plausible failure (money movement, PII leakage, downtime, brand harm)?
- Assign a single accountable owner (DRI). One person, not a committee.

2) Define the “work packet” (Day 3–7)
- Scope boundary: what the agent may do, and what it must not do.
- Data boundary: allowed sources (docs, tickets, repos) and explicitly forbidden sources.
- Tool boundary: which tools it can call (repo read, PR create, ticket update, email draft) and which are blocked.
- Acceptance boundary: what must be true for an output to be accepted.

3) Require proof artifacts (Week 2)
- For code: tests passed, lint passed, security scan passed, linked ticket/PR description with rationale.
- For knowledge work: citations to sources, confidence flags, and a structured summary.
- For ops: full audit log of tool calls and policy checks.
Rule: No proof artifact = no ship.

4) Set metrics + thresholds (Week 2)
- Acceptance rate (target: start with 60–70%, improve to 80–90% depending on domain).
- Escaped defect rate / incident attribution (target: explicit weekly review).
- Time-to-trust proxy: reviewer time per output; % outputs requiring deep rework.
- Cost per accepted output (tokens + tool calls + evals).

5) Budget and controls (Week 2–3)
- Set a monthly spend cap for the pilot (example: $5k–$15k).
- Rate limit runs/day and max tokens/run.
- Create named agent identities with least privilege.
- Turn on logging for prompts, retrievals, and tool calls (redact secrets/PII).

6) Launch behind a gate (Week 3–4)
- 100% human approval initially.
- Standardize the review rubric (what reviewers check every time).
- Maintain a rollback switch (feature flag) to disable the workflow instantly.

7) Build evals from real failures (Week 4–8)
- Categorize failures weekly (missing edge case, wrong source, policy violation, formatting, unsafe action).
- Add each failure as an eval case; run regression before changes.
- Set drift triggers: if acceptance drops or defects rise above threshold, auto-disable and investigate.

8) Scale deliberately (Week 6–10)
- Relax gates only after sustained performance (e.g., 85% acceptance for 14 days).
- Expand to adjacent workflows within the same risk tier.
- Document decision rights: who can change prompts/tools/policies; who approves model changes.

Exit criteria for “production-ready”
- Named DRI, documented work packet, enforceable permissions.
- Proof artifacts attached to every output.
- Metrics dashboard live (acceptance, defects, cost/accepted output).
- Rollback triggers defined and tested.
- Monthly audit cadence scheduled (security + workflow owner).