Compound AI Production Readiness Checklist (2026) Use this checklist to move from “LLM feature” to a production-grade compound AI system. 1) Define the workflow and KPI - Pick a single workflow (e.g., support resolution, sales quoting, IT triage). - Define one primary KPI (resolution rate, cycle time, acceptance rate) and one safety KPI (wrong-action rate, escalation rate, policy violations). 2) Map the compound graph - Write the steps explicitly: route → retrieve → plan → tool execute → verify → respond → commit side effects. - For each step, write the failure mode and fallback (retry, degrade to smaller model, template response, human handoff). 3) Choose model tiers and routing rules - Assign a cheap/fast model for classification and extraction. - Assign a mid-tier model for drafting. - Define escalation conditions to a frontier model (low confidence, complex reasoning, high-value users). - Set latency budgets per step (e.g., classifier <200ms, retrieval <300ms, generation <2.5s). 4) Build permissioned retrieval - Index only approved sources. - Enforce document ACLs at query time. - Use hybrid search (keyword + vector) and reranking. - Define freshness SLAs (e.g., policies updated within 24 hours). 5) Constrain tools like APIs, not prompts - Prefer narrow tool endpoints (e.g., “issue_refund_under_50”) over generic “update_record.” - Validate all tool parameters server-side. - Use scoped, short-lived credentials. 6) Add verification before side effects - Add deterministic schema checks (JSON schema, regex, type validation). - Add a verifier model for citation validity, PII leakage, and policy compliance. - Commit side effects only after verification passes. 7) Implement observability and audit logs - Correlation IDs per run. - Store prompts/outputs with redaction; log tool calls and results. - Dashboard: completion rate, escalation rate, tool error rate, cost per task, p95 latency. 8) Stand up an evaluation harness - Create a golden set from real interactions (start with 100–300 examples). - Label common failure modes (hallucination, wrong action, missing citation, policy breach). - Run evals on every model/prompt/config change; gate deployments on KPI thresholds. 9) Control costs deliberately - Add semantic caching for frequent queries. - Reduce context size with selective retrieval (3–8 chunks) and deduping. - Monitor cost per successful outcome (not per request). 10) Run security drills and operational reviews - Prompt-injection tests against your retrieval and tool layer. - Human-in-the-loop policies for sensitive actions. - Monthly review of incidents, near-misses, and model upgrades. If you can check off items 1–7, you can ship safely to early customers. If you can check off 8–10, you can scale to enterprise expectations.