Production Inference Readiness Checklist (Token P&L + Reliability)

Use this to pressure-test one LLM workflow before it becomes a cost and reliability problem.

1) Define the workflow contract
- What is the user goal and the system’s “done” condition?
- What outputs must be machine-consumable (JSON, fields, enums) vs human-readable?
- What actions are reversible vs irreversible?

2) Build a Token P&L (per successful outcome)
- Log input tokens, retrieved context tokens, output tokens, and retries.
- Separate “happy path” from tail cases (very long inputs, many tool calls).
- Identify the top 2 token drivers (long prompts, repeated context, verbose outputs).

3) Put hard limits in writing
- Max input size and what happens when users exceed it (summarize, chunk, reject).
- Max output tokens per call; default to smaller outputs.
- Max agent steps/tool calls per request.

4) Routing plan (don’t start with a single model)
- Define at least two tiers: a cheaper default model and a premium escalation model.
- Decide the escalation trigger: uncertainty signal, validation failure, or user-visible risk.
- Add a “no-model” fast path for cases you can solve deterministically.

5) Retrieval discipline (if you use RAG)
- Track retrieval hit rate and “citation coverage” (did the answer cite retrieved sources?).
- Prevent prompt stuffing: cap context and prefer top-k chunks.
- Add a fallback when retrieval is empty or low confidence.

6) Output control and validation
- Require structured outputs for system actions (schemas validated server-side).
- Measure schema violation rate and handle it with a strict retry policy.
- Don’t let free-form text directly trigger irreversible actions.

7) Reliability engineering
- Server-side timeouts, circuit breakers, and fallbacks (including a degraded UX).
- Rate limits per tenant/user and backpressure behavior.
- Observe p50/p95 latency, timeout rate, and queue depth.

8) Governance and data handling
- Decide what to store: prompts, model outputs, tool arguments; default to redaction.
- Set retention and access controls; audit who can view traces.
- Ensure tenant isolation and prevent cross-tenant data mixing in retrieval.

9) Evals and change control
- Maintain a small regression set of real requests (redacted) and expected outputs.
- Run evals on prompt changes, tool changes, and model/provider changes.
- Document a rollback path for model versions and prompt versions.

10) Ship a runbook
- Top failure modes and what to do (provider outage, latency spike, tool failures).
- Escalation contacts and dashboards.
- Clear toggles: disable agent mode, force cheap model, reduce context, enable stricter caps.

If you can’t fill this out in one sitting, the system isn’t ready to scale usage.