MODEL CHOICE PRODUCT SPEC TEMPLATE (BEHAVIOR CONTRACTS + ROUTING POLICY)

Use this for ONE AI workflow (e.g., “Summarize support ticket,” “Draft outbound email,” “Update CRM record”). Keep it to 1–2 pages per workflow.

1) WORKFLOW DEFINITION
- Name:
- User goal (one sentence):
- Entry point in UI (screen, button, shortcut):
- Output destination (where it lands):
- Stakes: low / medium / high (define what can go wrong)

2) BEHAVIOR CONTRACT (TESTABLE)
A) Inputs you guarantee
- System-provided context (account, permissions, recent actions):
- User-provided context (fields, files):
- Retrieved context (docs, tickets, knowledge base):
- Hard limits (max files, max text length, allowed formats):

B) Output contract
- Output type: free text / structured JSON / tool calls only
- Required fields (if structured):
- Allowed tone/style constraints (if any):
- “No-answer” behavior (what it should do if it lacks info):

C) Tooling contract
- Allowed tools (list exact tool names):
- Disallowed tools:
- Confirmation rules (what requires explicit user approval):
- Idempotency expectations (what can be safely retried):

D) Refusal and safety contract
- Must refuse for (list categories relevant to your domain):
- Must escalate to human for:
- Must ask clarifying question when:

E) Evidence/grounding contract
- Must cite sources when:
- What counts as a valid source in your system:
- Forbidden sources (e.g., “general knowledge” for policy text):

3) ROUTING POLICY (PRODUCT DECISIONS)
- Default model class: small-fast / long-context / tool-reliable / best-available
- When to route to higher-end model:
- When to route to cheaper model:
- Fallback triggers (timeouts, schema failures, refusal spikes):
- Customer/org allowlist support needed? yes/no
- Region/data locality constraints:

4) EVALS AS SHIP GATES (QA, NOT RESEARCH)
Create a “golden set” of 20–100 cases.
- Tool correctness tests (expected tool + arguments):
- Grounding tests (includes “no-answer” cases):
- Formatting/schema tests (invalid JSON is a failure):
- Safety/refusal tests (domain-specific):
- Recovery tests (tool failure, partial data):

Ship gates (fill these in):
- Block release if:
 - Any test can mutate user/customer data incorrectly
 - Schema breaks downstream parsing
 - Required citations are missing
- Canary plan:
 - Traffic slice:
 - Metrics to watch (tool errors, refusal rate shifts, latency):
 - Auto-rollback rule:

5) OBSERVABILITY & AUDIT
- What you log: prompts, retrieved docs IDs, tool calls, outputs
- Redaction rules (fields to redact):
- User-facing history: what users can view/export
- Admin controls: disable features by role/org

6) COST & UX CONTROLS
- Context budget rules (what gets trimmed first):
- Progressive depth (cheap draft → user confirms deeper run):
- Caching artifacts (summaries, extracted entities):
- Metering UI (how users see background work):

DONE CHECK
If you can swap model providers and the spec still holds, you built a product surface. If the spec changes per model, you’re shipping vendor behavior, not your own.