Compound AI Production Readiness Checklist (2026)

Use this checklist to move from “LLM feature” to a production-grade compound AI system.

1) Define the workflow and KPI
- Pick a single workflow (e.g., support resolution, sales quoting, IT triage).
- Define one primary KPI (resolution rate, cycle time, acceptance rate) and one safety KPI (wrong-action rate, escalation rate, policy violations).

2) Map the compound graph
- Write the steps explicitly: route → retrieve → plan → tool execute → verify → respond → commit side effects.
- For each step, write the failure mode and fallback (retry, degrade to smaller model, template response, human handoff).

3) Choose model tiers and routing rules
- Assign a cheap/fast model for classification and extraction.
- Assign a mid-tier model for drafting.
- Define escalation conditions to a frontier model (low confidence, complex reasoning, high-value users).
- Set latency budgets per step (e.g., classifier <200ms, retrieval <300ms, generation <2.5s).

4) Build permissioned retrieval
- Index only approved sources.
- Enforce document ACLs at query time.
- Use hybrid search (keyword + vector) and reranking.
- Define freshness SLAs (e.g., policies updated within 24 hours).

5) Constrain tools like APIs, not prompts
- Prefer narrow tool endpoints (e.g., “issue_refund_under_50”) over generic “update_record.”
- Validate all tool parameters server-side.
- Use scoped, short-lived credentials.

6) Add verification before side effects
- Add deterministic schema checks (JSON schema, regex, type validation).
- Add a verifier model for citation validity, PII leakage, and policy compliance.
- Commit side effects only after verification passes.

7) Implement observability and audit logs
- Correlation IDs per run.
- Store prompts/outputs with redaction; log tool calls and results.
- Dashboard: completion rate, escalation rate, tool error rate, cost per task, p95 latency.

8) Stand up an evaluation harness
- Create a golden set from real interactions (start with 100–300 examples).
- Label common failure modes (hallucination, wrong action, missing citation, policy breach).
- Run evals on every model/prompt/config change; gate deployments on KPI thresholds.

9) Control costs deliberately
- Add semantic caching for frequent queries.
- Reduce context size with selective retrieval (3–8 chunks) and deduping.
- Monitor cost per successful outcome (not per request).

10) Run security drills and operational reviews
- Prompt-injection tests against your retrieval and tool layer.
- Human-in-the-loop policies for sensitive actions.
- Monthly review of incidents, near-misses, and model upgrades.

If you can check off items 1–7, you can ship safely to early customers. If you can check off 8–10, you can scale to enterprise expectations.