AI CONTROL PLANE STARTER KIT (2026) — 30-DAY CHECKLIST

Goal: In 30 days, move from ad-hoc model calls to a governable AI control plane with routing, observability, budgets, evaluation, and basic security policy.

WEEK 1 — STANDARDIZE ACCESS + LOGGING
1) Pick a single entry point for model calls (gateway service or vendor gateway). Success metric: 90%+ of new AI features use the entry point.
2) Define a request envelope: use-case name, prompt/tool version, user/org ID, environment, and required safety flags.
3) Implement tracing fields: tokens in/out, latency, model name, tool calls (name + arguments), retrieval doc IDs, and final outcome status.
4) Create a “cost per task” dashboard (even if rough): dollars per ticket, per PR, per lead, etc.

WEEK 2 — ROUTING + BUDGETS (THE FINOPS BASELINE)
5) Define 3 model tiers (cheap / mid / premium). Map 5–10 workflows to tiers.
6) Add fallbacks: if provider fails or times out, route to a backup.
7) Set hard limits per workflow: max tokens, max tool calls, max retries, and max wall-clock time.
8) Implement graceful degradation: under budget pressure or load, reduce context size, switch to cheaper model, or return “needs human review.”

WEEK 3 — EVALUATION IN CI (STOP REGRESSIONS)
9) Build a golden set: start with 200 real tasks sampled from production logs (redact PII). Add edge cases.
10) Choose scoring: structured checks (JSON/schema), plus an LLM judge only where unavoidable. Calibrate judges with spot checks.
11) Add CI gating: fail PRs if pass rate drops beyond a threshold (e.g., -2 to -5 percentage points depending on volatility).
12) Add a regression triage loop: every week, review top 20 failures and convert them into new tests.

WEEK 4 — POLICY + PERMISSIONS (MAKE IT AUDITABLE)
13) Define agent permission boundaries: read vs write vs execute. Separate “draft” from “send/commit/charge.”
14) Enforce retrieval allowlists (which indexes can be used per workflow) and chunk caps.
15) Add redaction: strip or mask PII in logs; ensure retention rules are documented.
16) Produce an audit trace per run: inputs, retrieved sources, tool calls, outputs, timestamps, and identity.

OPERATING RHYTHM (ONGOING)
- Monthly AI spend report: by team and workflow; include top cost drivers and top failure modes.
- Incident taxonomy: runaway loops, tool misuse, policy blocks, retrieval failures, model/provider outages.
- Change control: prompts/tools/policies are versioned artifacts; every change has an eval run and rollback plan.

DONE DEFINITION
You can answer, for any agent run in production: (1) what it did (trace), (2) what it cost (tokens and dollars), and (3) why it made its decision (retrieval + tool rationale).