30-Day Production Agent Checklist (Autonomy With Guardrails)

Use this checklist to ship one workflow agent to production in ~30 days without creating an un-auditable, unbounded risk surface.

1) Scope & Success Criteria (Days 1–3)
- Choose ONE workflow with a clear start/end (e.g., ticket triage + draft response; invoice match + exception routing).
- Define a single primary metric and baseline it from last 30 days (examples: cost per ticket, first-response time, % tickets deflected, % invoices auto-matched).
- Define hard boundaries: what the agent must never do (e.g., delete data, change permissions, send bulk external emails).
- Define fallback behavior: when uncertain, escalate to a human queue.

2) Data & Integrations (Days 3–7)
- List systems of record (Zendesk/Salesforce/Stripe/NetSuite/GitHub) and identify required read vs write operations.
- Start with read-only tokens; implement write only after logging + review.
- Create a de-identified dataset of 200–500 real tasks for evaluation.
- Confirm data retention and logging policy (what you store, for how long, and where).

3) Agent Design (Days 7–14)
- Specify tool schemas as APIs: required fields, types, allowed enums, and error codes.
- Add a policy gate in code before any tool execution (amount thresholds, recipient limits, domain allowlists, etc.).
- Implement deterministic validators (JSON schema validation, required fields, formatting rules).
- Add caching and rate-limit handling; set timeouts and max retries per task.

4) Evaluation & Release Plan (Days 14–21)
- Build offline evals: routing accuracy, extraction correctness, policy-trigger correctness, and output quality.
- Implement trace capture: prompts, retrieved context, tool inputs/outputs, and final decision.
- Run replay tests when changing prompts or model versions.
- Plan staged rollout: internal dogfood → 5% customers → 25% → 100%.

5) Production Controls (Days 21–30)
- Launch in “draft mode” first (human approve/send). Track acceptance rate and rejection reasons.
- Define escalation triggers: low confidence, missing required data, policy violation, >2 retries, tool errors.
- Create dashboards: cost per task, latency (p50/p95), tool-call count, escalation rate, error rate.
- Set kill switches: disable specific tools instantly; disable all write actions; fall back to read-only.

6) Post-Launch Iteration (Ongoing)
- Weekly: review top failure modes and update policies/validators (prefer code changes over prompt hacks).
- Monthly: expand to the next adjacent workflow ONLY after metrics improve and costs are stable.
- Quarterly: security review of tool scopes, secrets handling, and audit logs; refresh eval datasets with new real cases.

Outcome Goal
By day 30, you should be able to state (with evidence):
- The agent’s cost per completed task
- The agent’s success rate and escalation rate
- The exact controls that prevent unsafe actions
- The measurable impact on the chosen business metric