AI FEATURE CONTROL PLANE CHECKLIST (2026)

Use this checklist before launching (and whenever you change) an AI feature.

1) DEFINE THE JOB + SUCCESS
- User job statement: “When ___, I want ___, so I can ___.”
- Success metric (behavioral, not vibes): e.g., % accepted suggestions, % tasks completed, reduction in time-to-first-draft.
- Failure metric: escalation rate, undo rate within 30 seconds, complaint rate per 1,000 sessions.
- SLA targets: p95 latency (seconds), uptime (%), max error rate (%).

2) AUDITABILITY (REPLAYABILITY)
- Log a session ID for every interaction.
- Store model metadata: provider, model name, version/date, temperature/top_p.
- Store retrieval metadata: source system, document IDs, timestamps, chunk IDs, scores.
- Store tool-call metadata: tool name, inputs, outputs, permission checks, approvals.
- Build “replay” tooling: reproduce the same run with the same context in under 10 minutes.
- Set retention policy (e.g., 30 days) and redaction rules for sensitive fields.

3) QUALITY EVALS AS CI
- Create a golden set (50–500 examples) including edge cases and “must-not” behaviors.
- Define pass/fail rubrics: correctness, citation presence, refusal appropriateness, policy adherence.
- Run evals on every change: prompt edits, retrieval source changes, model upgrades.
- Track regressions by category (verbosity, hallucinations, refusal rate, formatting breaks).

4) POLICY + PERMISSIONS
- Role-based access: which roles can read/write which systems.
- Human-in-the-loop rules: require approval for email send, billing changes, record deletions.
- Rate limits per session and per user to prevent runaway agent loops.
- Denylist retrieval sources (HR private, legal privileged, etc.).

5) COST GOVERNANCE
- Define unit economics: cost per successful task (not per token).
- Add routing: small model for easy tasks; premium model only on hard cases.
- Add caching where safe (embeddings, retrieval results, deterministic summaries).
- Set spend alerts: per-customer daily budget, anomaly detection for spikes.

6) ROLLOUT + RELIABILITY
- Feature flag + staged rollout: internal → 1–5% → 25% → 100%.
- Compare cohorts for 7+ days; roll back if helpfulness drops or cost spikes.
- Design graceful degradation: fallback model, partial results, clear errors.
- Provide a changelog and (for enterprise) model version pinning or change notifications.

7) PRICING + PACKAGING
- Pick a value-aligned unit: AI-resolved ticket, document processed, task run.
- Include baseline entitlements; meter heavy usage with clear overage rules.
- Provide “usage receipts” to customers (tasks, credits, model tier used).
- Ensure margins: caps, throttles, and policy constraints that prevent unlimited premium usage.

If you can’t (a) replay a bad output, (b) measure helpfulness weekly, and (c) explain spend per task, you’re not shipping a product—you’re shipping a demo.