Enterprise LLM Build Spec (Retrieval + Tools + Evals)

Use this as a scoping document for one AI workflow. Keep it strict. If you can’t fill a line, that’s a risk you’re carrying.

1) Workflow definition
- User persona(s): (e.g., L2 support, SRE on-call, AE)
- Job to be done: (one sentence)
- Allowed outcomes: (what “done” means)
- Disallowed outcomes: (what must never happen)

2) System-of-record map (truth sources)
For each source, list:
- System: (Confluence, SharePoint, GitHub, Jira, Salesforce, BigQuery, etc.)
- Data type: (policies, runbooks, tickets, customer records)
- Freshness requirement: (how quickly changes must be reflected)
- Access model: (groups, roles, object-level ACLs)
- Retention/deletion requirements: (who can delete; how fast it disappears)

3) Retrieval spec (RAG)
- Index unit: (document/page/section)
- Chunking approach: (rule-based; include headings; max size)
- Metadata: (source URL, owner, timestamp, ACL tags)
- Permissioning: (how ACL trimming is enforced at query time)
- Citation requirement: (when citations are mandatory)
- “No answer” behavior: (what the assistant does when retrieval is weak)

4) Tooling spec (actions)
For each tool:
- Tool name + JSON schema (inputs/outputs)
- AuthZ model: (who can call it; scoped tokens)
- Side effects: (what it changes)
- Confirmation: (when human approval is required)
- Idempotency: (how retries won’t double-apply)
- Rate limits + timeouts: (engineering defaults)

5) Guardrails & policy
- Sensitive data rules: (PII, secrets, customer data)
- Restricted topics: (legal/HR/finance boundaries)
- Escalation path: (handoff to human; ticket creation)
- Redaction strategy: (logs, traces, stored prompts)

6) Observability & audit
- What is logged: (prompt, retrieved docs IDs, tool calls, outputs)
- Where logs live: (system + retention)
- Audit trail: (who asked, what sources were accessed)
- Incident playbook: (how to disable tools / rollback)

7) Evaluation gates (ship criteria)
Create a “golden set” of real scenarios.
- Retrieval checks: correct sources retrieved; citations present where required
- Tool checks: schema-valid calls; safe failure behavior
- Policy checks: refusal/escalation where appropriate
- Regression gate: run before every prompt/retriever/model change

8) Model portability plan
- Model interface: (single abstraction point)
- Prompt/versioning: (how changes are tracked)
- Fallback model: (what happens on outage/limits)
- Provider exit plan: (what breaks if you swap models)

If you do only one thing: define the truth sources and the allowed actions. Everything else becomes much easier—and fine-tuning becomes optional instead of a crutch.