AI FEATURE CONTROL PLANE CHECKLIST (2026) Use this checklist before launching (and whenever you change) an AI feature. 1) DEFINE THE JOB + SUCCESS - User job statement: “When ___, I want ___, so I can ___.” - Success metric (behavioral, not vibes): e.g., % accepted suggestions, % tasks completed, reduction in time-to-first-draft. - Failure metric: escalation rate, undo rate within 30 seconds, complaint rate per 1,000 sessions. - SLA targets: p95 latency (seconds), uptime (%), max error rate (%). 2) AUDITABILITY (REPLAYABILITY) - Log a session ID for every interaction. - Store model metadata: provider, model name, version/date, temperature/top_p. - Store retrieval metadata: source system, document IDs, timestamps, chunk IDs, scores. - Store tool-call metadata: tool name, inputs, outputs, permission checks, approvals. - Build “replay” tooling: reproduce the same run with the same context in under 10 minutes. - Set retention policy (e.g., 30 days) and redaction rules for sensitive fields. 3) QUALITY EVALS AS CI - Create a golden set (50–500 examples) including edge cases and “must-not” behaviors. - Define pass/fail rubrics: correctness, citation presence, refusal appropriateness, policy adherence. - Run evals on every change: prompt edits, retrieval source changes, model upgrades. - Track regressions by category (verbosity, hallucinations, refusal rate, formatting breaks). 4) POLICY + PERMISSIONS - Role-based access: which roles can read/write which systems. - Human-in-the-loop rules: require approval for email send, billing changes, record deletions. - Rate limits per session and per user to prevent runaway agent loops. - Denylist retrieval sources (HR private, legal privileged, etc.). 5) COST GOVERNANCE - Define unit economics: cost per successful task (not per token). - Add routing: small model for easy tasks; premium model only on hard cases. - Add caching where safe (embeddings, retrieval results, deterministic summaries). - Set spend alerts: per-customer daily budget, anomaly detection for spikes. 6) ROLLOUT + RELIABILITY - Feature flag + staged rollout: internal → 1–5% → 25% → 100%. - Compare cohorts for 7+ days; roll back if helpfulness drops or cost spikes. - Design graceful degradation: fallback model, partial results, clear errors. - Provide a changelog and (for enterprise) model version pinning or change notifications. 7) PRICING + PACKAGING - Pick a value-aligned unit: AI-resolved ticket, document processed, task run. - Include baseline entitlements; meter heavy usage with clear overage rules. - Provide “usage receipts” to customers (tasks, credits, model tier used). - Ensure margins: caps, throttles, and policy constraints that prevent unlimited premium usage. If you can’t (a) replay a bad output, (b) measure helpfulness weekly, and (c) explain spend per task, you’re not shipping a product—you’re shipping a demo.