Agent Control Plane Launch Checklist (30-Day Template) Use this to ship ONE agent workflow into production with safety, cost controls, and debuggability. The goal is not a perfect platform; it’s a repeatable pattern. 1) Choose the reference workflow (Day 0) - Define the workflow in one sentence (e.g., “triage inbound support tickets and set priority + routing”). - Quantify ROI: target minutes saved/run and expected volume per day. - Define success criteria: what fields are written, what tools are called, what “done” means. 2) Identity & permissions (Days 1–7) - Create a dedicated non-human identity (NHI) for the agent per environment (dev/stage/prod). - Separate read tools from write tools; default to read-only. - Use short-lived credentials (cloud role sessions) and automated rotation. - Document who can change permissions (owners + review process). 3) Tool gateway & policy (Days 8–15) - Route 100% of tool calls through a gateway service. - Enforce allowlists: agent can only call enumerated tools. - Validate tool arguments with schemas; reject unknown fields. - Add rate limits (per run and per minute) and max-step caps. - Add approval policy for high-risk actions (bulk writes; payouts; prod config changes). 4) Observability & audit (Days 1–15) - Assign run_id and request_id for every run and tool call. - Log: model used, prompt version, retrieved docs IDs, tool inputs/outputs, policy decisions. - Redact PII in traces by default; encrypt sensitive payloads. - Set retention rules (e.g., 30 days raw traces, 180 days metadata) aligned with compliance. 5) Cost controls (Days 16–22) - Set per-run budget in USD and enforce hard stops. - Implement model routing: cheaper model for extraction/classification; stronger model only for planning/verification. - Cache retrieval results and batch tool calls where possible. - Create dashboards: cost/run, p95 latency, tool calls/run, failure rate. 6) Evaluation & releases (Days 23–30) - Build a golden dataset (50–200 examples) that matches production distribution. - Write eval checks tied to outcomes (correct updates, correct routing, required citations). - Add CI gating: prompt/model changes must pass eval thresholds. - Deploy with canaries: 5% traffic for 24 hours before full rollout. 7) Incident readiness (Ongoing) - Define rollback plan (prompt version, model version, tool permissions). - Implement circuit breakers for degraded upstream APIs. - Add idempotency keys to write tools to avoid duplicate side effects. - Set on-call playbook: how to pause agents, revoke credentials, and export audit logs. Exit criteria (ship it) - You can replay a failed run end-to-end with the same trace. - You can prove which permissions the agent had at the time of action. - You can cap cost/run and stop runaway looping. - You have a regression suite and a release gate for changes.