30-Day Production Agent Checklist (Autonomy With Guardrails) Use this checklist to ship one workflow agent to production in ~30 days without creating an un-auditable, unbounded risk surface. 1) Scope & Success Criteria (Days 1–3) - Choose ONE workflow with a clear start/end (e.g., ticket triage + draft response; invoice match + exception routing). - Define a single primary metric and baseline it from last 30 days (examples: cost per ticket, first-response time, % tickets deflected, % invoices auto-matched). - Define hard boundaries: what the agent must never do (e.g., delete data, change permissions, send bulk external emails). - Define fallback behavior: when uncertain, escalate to a human queue. 2) Data & Integrations (Days 3–7) - List systems of record (Zendesk/Salesforce/Stripe/NetSuite/GitHub) and identify required read vs write operations. - Start with read-only tokens; implement write only after logging + review. - Create a de-identified dataset of 200–500 real tasks for evaluation. - Confirm data retention and logging policy (what you store, for how long, and where). 3) Agent Design (Days 7–14) - Specify tool schemas as APIs: required fields, types, allowed enums, and error codes. - Add a policy gate in code before any tool execution (amount thresholds, recipient limits, domain allowlists, etc.). - Implement deterministic validators (JSON schema validation, required fields, formatting rules). - Add caching and rate-limit handling; set timeouts and max retries per task. 4) Evaluation & Release Plan (Days 14–21) - Build offline evals: routing accuracy, extraction correctness, policy-trigger correctness, and output quality. - Implement trace capture: prompts, retrieved context, tool inputs/outputs, and final decision. - Run replay tests when changing prompts or model versions. - Plan staged rollout: internal dogfood → 5% customers → 25% → 100%. 5) Production Controls (Days 21–30) - Launch in “draft mode” first (human approve/send). Track acceptance rate and rejection reasons. - Define escalation triggers: low confidence, missing required data, policy violation, >2 retries, tool errors. - Create dashboards: cost per task, latency (p50/p95), tool-call count, escalation rate, error rate. - Set kill switches: disable specific tools instantly; disable all write actions; fall back to read-only. 6) Post-Launch Iteration (Ongoing) - Weekly: review top failure modes and update policies/validators (prefer code changes over prompt hacks). - Monthly: expand to the next adjacent workflow ONLY after metrics improve and costs are stable. - Quarterly: security review of tool scopes, secrets handling, and audit logs; refresh eval datasets with new real cases. Outcome Goal By day 30, you should be able to state (with evidence): - The agent’s cost per completed task - The agent’s success rate and escalation rate - The exact controls that prevent unsafe actions - The measurable impact on the chosen business metric