Agent Control Plane Launch Checklist (30-Day Template)

Use this to ship ONE agent workflow into production with safety, cost controls, and debuggability. The goal is not a perfect platform; it’s a repeatable pattern.

1) Choose the reference workflow (Day 0)
- Define the workflow in one sentence (e.g., “triage inbound support tickets and set priority + routing”).
- Quantify ROI: target minutes saved/run and expected volume per day.
- Define success criteria: what fields are written, what tools are called, what “done” means.

2) Identity & permissions (Days 1–7)
- Create a dedicated non-human identity (NHI) for the agent per environment (dev/stage/prod).
- Separate read tools from write tools; default to read-only.
- Use short-lived credentials (cloud role sessions) and automated rotation.
- Document who can change permissions (owners + review process).

3) Tool gateway & policy (Days 8–15)
- Route 100% of tool calls through a gateway service.
- Enforce allowlists: agent can only call enumerated tools.
- Validate tool arguments with schemas; reject unknown fields.
- Add rate limits (per run and per minute) and max-step caps.
- Add approval policy for high-risk actions (bulk writes; payouts; prod config changes).

4) Observability & audit (Days 1–15)
- Assign run_id and request_id for every run and tool call.
- Log: model used, prompt version, retrieved docs IDs, tool inputs/outputs, policy decisions.
- Redact PII in traces by default; encrypt sensitive payloads.
- Set retention rules (e.g., 30 days raw traces, 180 days metadata) aligned with compliance.

5) Cost controls (Days 16–22)
- Set per-run budget in USD and enforce hard stops.
- Implement model routing: cheaper model for extraction/classification; stronger model only for planning/verification.
- Cache retrieval results and batch tool calls where possible.
- Create dashboards: cost/run, p95 latency, tool calls/run, failure rate.

6) Evaluation & releases (Days 23–30)
- Build a golden dataset (50–200 examples) that matches production distribution.
- Write eval checks tied to outcomes (correct updates, correct routing, required citations).
- Add CI gating: prompt/model changes must pass eval thresholds.
- Deploy with canaries: 5% traffic for 24 hours before full rollout.

7) Incident readiness (Ongoing)
- Define rollback plan (prompt version, model version, tool permissions).
- Implement circuit breakers for degraded upstream APIs.
- Add idempotency keys to write tools to avoid duplicate side effects.
- Set on-call playbook: how to pause agents, revoke credentials, and export audit logs.

Exit criteria (ship it)
- You can replay a failed run end-to-end with the same trace.
- You can prove which permissions the agent had at the time of action.
- You can cap cost/run and stop runaway looping.
- You have a regression suite and a release gate for changes.