COMPOUND AI SHIPPING CHECKLIST (2026)

Use this as a pre-launch and ongoing operations checklist for production AI systems (RAG, agents, copilots).

1) WORKFLOW & RISK DEFINITION
- Name the workflow (e.g., “support resolution”, “refund handling”, “PR generation”).
- Define success as a business metric: containment rate, AHT reduction, PR acceptance rate, time-to-resolution.
- Classify risk tier: LOW (drafting), MED (user-facing advice), HIGH (money movement, auth, legal commitments).
- Define allowed actions per tier. For HIGH risk, require verifier + structured outputs.

2) ROUTING & BUDGETS (COST + LATENCY)
- Implement an intent classifier/router (can be a small model) with explicit fallback rules.
- Set budgets: max tool calls, max tokens, max wall-clock time, and max retries.
- Track CPST (cost per successful task). Establish a target (e.g., <$0.01 for common actions).
- Add caching for stable outputs (policy summaries, common troubleshooting steps) with TTL and invalidation rules.

3) PRIVATE DATA PLANE (GOVERNED CONTEXT)
- Inventory data sources (Docs, CRM, tickets, code, BI). Assign freshness SLAs per source.
- Implement permissioned retrieval: filter by identity and document-level/row-level ACLs BEFORE the model sees context.
- Log every retrieval: request_id, document IDs, chunk IDs, timestamps, and user identity.
- Set retention and deletion policies: how long prompts, outputs, and embeddings are stored.

4) TOOLING SAFETY (AGENTS)
- Create a tool allowlist per intent. Deny-by-default.
- Require structured outputs for actions (JSON schema) and validate before execution.
- Add a verifier step for HIGH-risk actions (second model or deterministic checks).
- Implement “safe failure states”: explicit “cannot complete” + escalation path.

5) EVALS & RELEASE ENGINEERING
- Build a golden set (50–500 examples) covering normal + edge cases.
- Add adversarial cases: prompt injection, data exfiltration attempts, permission bypass attempts.
- Define pass/fail gates: groundedness, citation correctness, refusal correctness, tool safety.
- Use shadow traffic (1–5%) before ramp; have rollback criteria (e.g., +2% escalation, +20% latency).

6) OBSERVABILITY & ON-CALL
- Instrument traces across retrieval, model calls, tool calls, and final outputs.
- Monitor: P50/P95 latency, tool error rate, misroute rate, refusal rate, and cost per outcome.
- Establish an on-call playbook: how to disable a tool, flip a routing rule, or revert a prompt.

7) SECURITY & COMPLIANCE READINESS
- Document data flow: where data transits, where it’s stored, and who can access logs.
- Provide customer-facing controls: data residency (if applicable), retention settings, audit export.
- Run regular reviews with Security and Legal for HIGH-risk workflows.

If you can’t answer: “What did the system retrieve, which tools did it call, and why did it act?” you’re not ready for production.