RAG PRODUCTION EVAL STARTER PACK (1 WEEK)

Goal: Ship a minimal, repeatable evaluation loop for a RAG + tool-using AI feature. This is not a research project. Treat it like reliability engineering.

DAY 1 — DEFINE THE CONTRACT
- Choose 10 “golden” user questions that represent real product value (support, policy, onboarding, troubleshooting).
- For each question, write the expected answer shape (bullet list, steps, JSON fields) and the acceptable refusal behavior.
- Decide what “grounded” means for you: every non-trivial claim must cite a retrieved chunk, or at least the key claims must.

DAY 2 — INSTRUMENTATION (NO EXCUSES)
- Log for every request:
 - user/tenant ID (or an internal surrogate)
 - retriever config version (chunking, top-k, hybrid weights)
 - retrieved chunk IDs + source document IDs + scores
 - tool calls attempted (name + arguments) and whether they were allowed/denied
 - final answer + citations rendered
- Store traces somewhere queryable (even if it’s just your existing logging stack).

DAY 3 — RETRIEVAL REGRESSION TEST
- Build a script that runs the golden questions and outputs top-k doc IDs.
- Set a rule: changes in top-k are failures unless explicitly approved.
- If you change embeddings, chunking, or hybrid weights, run this before deploy.

DAY 4 — GROUNDING CHECKS
- Add a checker that flags:
 - answers with zero citations when citations are required
 - citations that don’t map to retrieved chunk IDs
 - “citation spam” (tons of citations without clear linkage)
- Start with strict rules; relax only with evidence.

DAY 5 — PERMISSION PROOFS
- Ensure every chunk has ACL metadata at index time.
- Enforce ACL filters at retrieval time (not after the answer).
- Add 5 adversarial tests: same question across two tenants/users where the allowed docs differ. The forbidden user must retrieve zero restricted chunks.

DAY 6 — TOOL BOUNDARIES
- Default tools off. Enable only what you can bound.
- Implement allowlists for tool arguments (IDs/domains/table names).
- Add 5 prompt-injection tests that try to:
 - override system instructions
 - request hidden prompts
 - expand tool scope
 - exfiltrate other-tenant data
- Pass criteria: tools are denied; retrieval stays in-scope; the model refuses or answers safely.

DAY 7 — CI GATE + CHANGE CONTROL
- Put the golden suite in CI.
- Block merges on:
 - unexpected retrieval top-k diffs
 - missing citations (if required)
 - any permission test failure
- Create a lightweight runbook: what to do when evals fail (rollback retriever config, revert prompt, pin embeddings, etc.).

ONGOING WEEKLY ROUTINE
- Add 5 new golden questions monthly from real user logs.
- Review retrieval diffs like you review failing unit tests.
- Track incidents by category: retrieval drift, permission leak, tool mistake, or answer grounding failure.

If you can produce (1) retrieved chunk IDs, (2) tool calls attempted, and (3) citations shown to the user for any production answer, you’re operating. If you can’t, you’re guessing.