LLM PRODUCTION READINESS FRAMEWORK (2026) Use this framework to take an LLM feature from prototype to production with predictable cost, auditable behavior, and reliable operations. 1) DEFINE THE WORKFLOW CONTRACT (1 page) - Primary user + job-to-be-done: - Allowed inputs (channels, file types, languages): - Allowed outputs (free text vs JSON schema): - Disallowed domains (legal advice, medical, hiring decisions, etc.): - “Success” definition (e.g., ticket resolved, draft accepted, refund approved): - Human-in-the-loop points (what requires approval): - Failure modes that are acceptable vs unacceptable: 2) DATA + RETRIEVAL BOUNDARIES - Data sources inventory (KB, CRM, docs, tickets, web): - Role-based access rules for retrieval (who can see what): - Tenant isolation rules (vector store + logs): - Provenance requirements (store doc IDs, timestamps, snippet ranges): - Redaction rules (PII, secrets, credentials) before logging: 3) MODEL STRATEGY + ROUTING - Default small/fast model for routine tasks: - Escalation model for complex/high-risk tasks: - Routing signals (intent classifier, risk score, user tier, uncertainty): - Hard budgets (max tokens, max calls, max $ per workflow completion): - Caching strategy (what can be cached safely, TTL, invalidation): 4) POLICY + GUARDRAILS - Prompt injection threat model (assume retrieved content is hostile): - Tool allowlist + strict schemas (typed inputs/outputs): - Tool permissions (least privilege; per-user and per-workflow): - Tool-call gate checks (amount limits, role checks, approvals): - Refusal behavior and safe completion policy: 5) EVALUATIONS (OFFLINE) + RELEASE GATES - Golden datasets by workflow (minimum 100–500 examples to start): - Metrics to track: groundedness, refusal correctness, tool accuracy, latency, cost per success - Regression tests on every change (prompt, retrieval, routing, tool schema): - Release gates (e.g., no critical regressions; cost within +10%): 6) OBSERVABILITY (ONLINE) + INCIDENT RESPONSE - Logging requirements: prompt version, model route, retrieval doc IDs, tool calls, outputs (redacted) - Dashboards: P50/P95 latency, error rate, cost per success, policy violation counts - Canary rollout plan (start 1–5% traffic) + rollback procedure - Incident playbook: how to reproduce a trace, how to disable tools, how to pin model versions - Postmortem template (root cause, blast radius, corrective actions, eval additions) 7) BUSINESS READINESS - Unit economics: target $ per completed workflow and monthly budget - ROI metric: time saved, deflection rate, conversion lift, or error reduction - Customer-facing documentation: limitations, safety boundaries, escalation options - Compliance artifacts: audit logs retention policy, data residency, training/data-use terms FINAL SHIP CHECK (YES/NO) - Can we reconstruct “why” for any output (trace completeness)? - Do we have a deterministic fallback if the model fails? - Are tool calls gated and least-privilege? - Are costs bounded per workflow (not just per request)? - Do we have regression evals and a canary/rollback path? - Does security/compliance sign off on logging + retention + data boundaries?