Production Inference Readiness Checklist (Token P&L + Reliability) Use this to pressure-test one LLM workflow before it becomes a cost and reliability problem. 1) Define the workflow contract - What is the user goal and the system’s “done” condition? - What outputs must be machine-consumable (JSON, fields, enums) vs human-readable? - What actions are reversible vs irreversible? 2) Build a Token P&L (per successful outcome) - Log input tokens, retrieved context tokens, output tokens, and retries. - Separate “happy path” from tail cases (very long inputs, many tool calls). - Identify the top 2 token drivers (long prompts, repeated context, verbose outputs). 3) Put hard limits in writing - Max input size and what happens when users exceed it (summarize, chunk, reject). - Max output tokens per call; default to smaller outputs. - Max agent steps/tool calls per request. 4) Routing plan (don’t start with a single model) - Define at least two tiers: a cheaper default model and a premium escalation model. - Decide the escalation trigger: uncertainty signal, validation failure, or user-visible risk. - Add a “no-model” fast path for cases you can solve deterministically. 5) Retrieval discipline (if you use RAG) - Track retrieval hit rate and “citation coverage” (did the answer cite retrieved sources?). - Prevent prompt stuffing: cap context and prefer top-k chunks. - Add a fallback when retrieval is empty or low confidence. 6) Output control and validation - Require structured outputs for system actions (schemas validated server-side). - Measure schema violation rate and handle it with a strict retry policy. - Don’t let free-form text directly trigger irreversible actions. 7) Reliability engineering - Server-side timeouts, circuit breakers, and fallbacks (including a degraded UX). - Rate limits per tenant/user and backpressure behavior. - Observe p50/p95 latency, timeout rate, and queue depth. 8) Governance and data handling - Decide what to store: prompts, model outputs, tool arguments; default to redaction. - Set retention and access controls; audit who can view traces. - Ensure tenant isolation and prevent cross-tenant data mixing in retrieval. 9) Evals and change control - Maintain a small regression set of real requests (redacted) and expected outputs. - Run evals on prompt changes, tool changes, and model/provider changes. - Document a rollback path for model versions and prompt versions. 10) Ship a runbook - Top failure modes and what to do (provider outage, latency spike, tool failures). - Escalation contacts and dashboards. - Clear toggles: disable agent mode, force cheap model, reduce context, enable stricter caps. If you can’t fill this out in one sitting, the system isn’t ready to scale usage.