MODEL-CONTROL PLANE BUILD SHEET (30-DAY STARTER)

Goal: Put ONE production workflow behind a model-control plane with routing, eval gates, policy enforcement, tracing, and budget controls. Don’t boil the ocean.

1) Pick the workflow (Day 1)
- Choose a high-traffic or high-risk path (support reply draft, invoice extraction, internal Q&A, etc.).
- Write the “definition of done” in one paragraph: what output is acceptable, what must never happen.

2) Create the request contract (Days 1–3)
- Define a request schema: user intent, tenant/workspace, data classification (public/internal/PII), allowed tools.
- Define output requirements: JSON schema if possible, citations required or not, max length, forbidden content.

3) Put a gateway in front (Days 3–7)
- Centralize all model calls behind a single internal endpoint.
- Implement timeouts, retries, and explicit fallbacks (model A → model B → “degraded mode”).
- Tag every request with: tenant, feature name, version, and cost center.

4) Add tracing and audit logs (Days 5–10)
- Log: prompt template version, system instructions, retrieval query, retrieved doc IDs, tool calls + args, final output.
- Tie logs to a user action ID so you can replay incidents.
- Decide retention and redaction rules (especially for PII/PHI).

5) Build evals that can fail a release (Days 7–18)
- Create a small “golden set” of real inputs (sanitized) that represent core cases + edge cases.
- Add property checks: “no restricted tool calls,” “no secrets,” “must include citations,” “must be valid JSON,” etc.
- Run evals on every prompt/tool change. If it fails, it doesn’t ship.

6) Implement policy enforcement (Days 10–22)
- Tool permissioning: least privilege by tenant and by workflow.
- Validate tool inputs like an API: strict schemas, allowlists for URLs/domains, length limits.
- Separate instruction from data: treat retrieved text as untrusted.

7) Put budgets and throttles in place (Days 14–26)
- Set per-tenant rate limits and soft budgets.
- Design graceful degradation: smaller model, shorter context, cached answer, or “human review required.”
- Add alerts keyed to feature/tenant, not just cloud billing.

8) Run an incident drill (Days 20–30)
- Simulate: provider outage, rate limiting, model behavior drift, prompt injection attempt.
- Verify: detection (alerts), containment (fallback/degrade), diagnosis (traces), and recovery steps.

Deliverables at Day 30:
- One workflow fully behind the gateway.
- CI eval gate with a golden dataset.
- Tracing/audit logs with replay capability.
- Enforced tool permissions + input validation.
- Tenant budgets + graceful degradation behavior.

If you only do one thing: make prompt/tool changes unshippable without evals and traces. That’s the line between “AI demo” and “AI product.”