Deterministic AI Wrapper Checklist (Production-Ready)

Use this checklist to turn an LLM integration into a system you can operate.

1) Define the blast radius
- List every possible external side effect: emails, tickets, CRM writes, database updates, refunds, permission changes, code pushes.
- For each side effect, define “allowed,” “allowed with approval,” and “never.” Write it down.

2) Put a contract between the model and the system
- Require structured output (schema-based JSON or tool calls).
- Validate inputs: enums, max lengths, required fields, and unknown-field rejection.
- Reject on schema failure and route to a safe fallback (human review or no-op).

3) Build deterministic executors for tools
- Each tool wrapper enforces: authentication, tenant boundaries, least privilege, rate limits.
- Add idempotency keys so retries don’t duplicate side effects.
- Make tools return typed results and error codes (not prose).

4) Treat retrieval (RAG) like a dependency
- Log ingestion versions, embedding model/version, and chunking strategy.
- Log retrieval traces: query, filters, top-k doc IDs, and scores.
- Require citations and store cited spans; block answers that go beyond retrieved sources for high-stakes endpoints.

5) Evals are a shipping gate, not a dashboard
- Maintain a small, representative test set of real tasks and edge cases.
- Run evals on every change: prompt, model, retrieval params, embedding model, tool wrappers.
- Track regressions by category: schema validity, tool misuse attempts, groundedness, refusal correctness.

6) Observability and replay
- Log: user/tenant, prompt (or prompt template + variables), retrieved docs, tool calls, model identifier, and final action.
- Make it replayable: you should be able to reproduce a failure with the same inputs.

7) Fallbacks and escalation
- Define when the system refuses, asks clarifying questions, or escalates to a human.
- Add circuit breakers: disable specific tools or endpoints quickly without redeploying.

8) Model/provider changes
- Pin versions where possible.
- Assume drift anyway: schedule periodic eval runs against production-like traffic samples.
- Design so model swaps don’t change tool contracts.

If you can’t answer “what did the model see, what did it call, and why did the system allow it,” you don’t have an AI product—you have an incident waiting for a calendar invite.