PRODUCTION LLM FEATURE READINESS CHECKLIST (RAG + GUARDRAILS + EVALS)

Use this before you scale an LLM feature beyond a small pilot. The goal is not “best prompts.” The goal is a system you can operate.

1) REQUEST TRACEABILITY (INCIDENT REPLAY)
- Every request has a request_id.
- You log: model/provider, model version (if exposed), temperature/top_p, system prompt/instructions, user input, retrieved source IDs, tool calls, final output.
- Logs have a retention policy and access controls (who can read prompts/outputs).

2) RETRIEVAL (RAG) DATA PLANE
- Indexing jobs are scheduled and observable (success/fail, last run time).
- Documents have stable IDs and metadata (owner, source system, updated_at).
- Chunking rules are documented (how you treat tables, code blocks, PDFs).
- Retrieval stores provenance: which source IDs were retrieved for each answer.
- You have a defined fallback when retrieval fails (refuse, ask a clarifying question, or hand off).

3) PERMISSIONS AND TENANCY
- Retrieval enforces ACLs at query time based on user identity (not after generation).
- You have tests for cross-tenant leakage (explicit adversarial prompts included).
- Sensitive sources (HR, legal, customer PII repositories) are explicitly included/excluded by policy.

4) TOOL CALLING AND ACTION SAFETY
- Tool allowlist exists per workflow.
- Tool parameters are validated server-side (never trust the model output).
- Timeouts and retries are defined per tool.
- High-risk actions require confirmation or human approval (money movement, deletions, outbound messages).

5) OUTPUT CONSTRAINTS
- Structured outputs use schemas (JSON schema or equivalent).
- You fail closed: if schema validation fails, you retry with stricter constraints or refuse.
- Citations are required for knowledge answers; outputs without sources are treated as failures.

6) EVALUATION AND REGRESSION
- You maintain a golden set of real queries (sanitized) that represent common + worst-case behavior.
- You evaluate both retrieval (did we fetch the right sources?) and generation (did we answer correctly and safely?).
- Evals run in CI for prompt/model/retrieval changes.
- You have explicit pass/fail gates for launch and for ongoing deploys.

7) SECURITY TESTS YOU ACTUALLY RUN
- Prompt injection attempts against tool calling.
- Data exfiltration prompts (ask for secrets, system prompts, or other users’ data).
- Jailbreak-style inputs aimed at policy violations.
- Regression tests for previously observed failures.

8) OPERATIONS
- Cost controls: caps or budgets per tenant/workspace; rate limits; caching where appropriate.
- Provider outage plan: fallback model/provider, degraded mode, or feature flag kill switch.
- Monitoring: latency, error rates, refusal rates, tool call failure rates, retrieval empty-rate.

9) PRODUCT INTEGRITY
- The UX communicates uncertainty: citations, “I don’t know,” and next steps.
- Humans can correct outputs; corrections feed back into eval cases.
- You have a policy for user data: what is stored, for how long, and whether it is used for training (by you or vendors).

If you can’t produce artifacts for each section (logs, test outputs, eval reports, permission tests), you’re not ready to scale usage.