Delegation With Receipts — Agent Workflow Launch Checklist (30–60 days) Purpose Use this checklist to launch an agent workflow that delivers measurable throughput without creating silent quality risk. The goal is a repeatable operating standard: bounded delegation + enforceable proofs. 1) Choose the workflow (Day 1–3) - Pick 1 workflow with clear inputs/outputs and an existing human baseline (e.g., Tier-1 support tagging, internal tool scaffolding, outbound email drafts). - Confirm the blast radius: what’s the worst plausible failure (money movement, PII leakage, downtime, brand harm)? - Assign a single accountable owner (DRI). One person, not a committee. 2) Define the “work packet” (Day 3–7) - Scope boundary: what the agent may do, and what it must not do. - Data boundary: allowed sources (docs, tickets, repos) and explicitly forbidden sources. - Tool boundary: which tools it can call (repo read, PR create, ticket update, email draft) and which are blocked. - Acceptance boundary: what must be true for an output to be accepted. 3) Require proof artifacts (Week 2) - For code: tests passed, lint passed, security scan passed, linked ticket/PR description with rationale. - For knowledge work: citations to sources, confidence flags, and a structured summary. - For ops: full audit log of tool calls and policy checks. Rule: No proof artifact = no ship. 4) Set metrics + thresholds (Week 2) - Acceptance rate (target: start with 60–70%, improve to 80–90% depending on domain). - Escaped defect rate / incident attribution (target: explicit weekly review). - Time-to-trust proxy: reviewer time per output; % outputs requiring deep rework. - Cost per accepted output (tokens + tool calls + evals). 5) Budget and controls (Week 2–3) - Set a monthly spend cap for the pilot (example: $5k–$15k). - Rate limit runs/day and max tokens/run. - Create named agent identities with least privilege. - Turn on logging for prompts, retrievals, and tool calls (redact secrets/PII). 6) Launch behind a gate (Week 3–4) - 100% human approval initially. - Standardize the review rubric (what reviewers check every time). - Maintain a rollback switch (feature flag) to disable the workflow instantly. 7) Build evals from real failures (Week 4–8) - Categorize failures weekly (missing edge case, wrong source, policy violation, formatting, unsafe action). - Add each failure as an eval case; run regression before changes. - Set drift triggers: if acceptance drops or defects rise above threshold, auto-disable and investigate. 8) Scale deliberately (Week 6–10) - Relax gates only after sustained performance (e.g., 85% acceptance for 14 days). - Expand to adjacent workflows within the same risk tier. - Document decision rights: who can change prompts/tools/policies; who approves model changes. Exit criteria for “production-ready” - Named DRI, documented work packet, enforceable permissions. - Proof artifacts attached to every output. - Metrics dashboard live (acceptance, defects, cost/accepted output). - Rollback triggers defined and tested. - Monthly audit cadence scheduled (security + workflow owner).