AI CONTROL PLANE STARTER KIT (2026) — 30-DAY CHECKLIST Goal: In 30 days, move from ad-hoc model calls to a governable AI control plane with routing, observability, budgets, evaluation, and basic security policy. WEEK 1 — STANDARDIZE ACCESS + LOGGING 1) Pick a single entry point for model calls (gateway service or vendor gateway). Success metric: 90%+ of new AI features use the entry point. 2) Define a request envelope: use-case name, prompt/tool version, user/org ID, environment, and required safety flags. 3) Implement tracing fields: tokens in/out, latency, model name, tool calls (name + arguments), retrieval doc IDs, and final outcome status. 4) Create a “cost per task” dashboard (even if rough): dollars per ticket, per PR, per lead, etc. WEEK 2 — ROUTING + BUDGETS (THE FINOPS BASELINE) 5) Define 3 model tiers (cheap / mid / premium). Map 5–10 workflows to tiers. 6) Add fallbacks: if provider fails or times out, route to a backup. 7) Set hard limits per workflow: max tokens, max tool calls, max retries, and max wall-clock time. 8) Implement graceful degradation: under budget pressure or load, reduce context size, switch to cheaper model, or return “needs human review.” WEEK 3 — EVALUATION IN CI (STOP REGRESSIONS) 9) Build a golden set: start with 200 real tasks sampled from production logs (redact PII). Add edge cases. 10) Choose scoring: structured checks (JSON/schema), plus an LLM judge only where unavoidable. Calibrate judges with spot checks. 11) Add CI gating: fail PRs if pass rate drops beyond a threshold (e.g., -2 to -5 percentage points depending on volatility). 12) Add a regression triage loop: every week, review top 20 failures and convert them into new tests. WEEK 4 — POLICY + PERMISSIONS (MAKE IT AUDITABLE) 13) Define agent permission boundaries: read vs write vs execute. Separate “draft” from “send/commit/charge.” 14) Enforce retrieval allowlists (which indexes can be used per workflow) and chunk caps. 15) Add redaction: strip or mask PII in logs; ensure retention rules are documented. 16) Produce an audit trace per run: inputs, retrieved sources, tool calls, outputs, timestamps, and identity. OPERATING RHYTHM (ONGOING) - Monthly AI spend report: by team and workflow; include top cost drivers and top failure modes. - Incident taxonomy: runaway loops, tool misuse, policy blocks, retrieval failures, model/provider outages. - Change control: prompts/tools/policies are versioned artifacts; every change has an eval run and rollback plan. DONE DEFINITION You can answer, for any agent run in production: (1) what it did (trace), (2) what it cost (tokens and dollars), and (3) why it made its decision (retrieval + tool rationale).