MODEL CHOICE PRODUCT SPEC TEMPLATE (BEHAVIOR CONTRACTS + ROUTING POLICY) Use this for ONE AI workflow (e.g., “Summarize support ticket,” “Draft outbound email,” “Update CRM record”). Keep it to 1–2 pages per workflow. 1) WORKFLOW DEFINITION - Name: - User goal (one sentence): - Entry point in UI (screen, button, shortcut): - Output destination (where it lands): - Stakes: low / medium / high (define what can go wrong) 2) BEHAVIOR CONTRACT (TESTABLE) A) Inputs you guarantee - System-provided context (account, permissions, recent actions): - User-provided context (fields, files): - Retrieved context (docs, tickets, knowledge base): - Hard limits (max files, max text length, allowed formats): B) Output contract - Output type: free text / structured JSON / tool calls only - Required fields (if structured): - Allowed tone/style constraints (if any): - “No-answer” behavior (what it should do if it lacks info): C) Tooling contract - Allowed tools (list exact tool names): - Disallowed tools: - Confirmation rules (what requires explicit user approval): - Idempotency expectations (what can be safely retried): D) Refusal and safety contract - Must refuse for (list categories relevant to your domain): - Must escalate to human for: - Must ask clarifying question when: E) Evidence/grounding contract - Must cite sources when: - What counts as a valid source in your system: - Forbidden sources (e.g., “general knowledge” for policy text): 3) ROUTING POLICY (PRODUCT DECISIONS) - Default model class: small-fast / long-context / tool-reliable / best-available - When to route to higher-end model: - When to route to cheaper model: - Fallback triggers (timeouts, schema failures, refusal spikes): - Customer/org allowlist support needed? yes/no - Region/data locality constraints: 4) EVALS AS SHIP GATES (QA, NOT RESEARCH) Create a “golden set” of 20–100 cases. - Tool correctness tests (expected tool + arguments): - Grounding tests (includes “no-answer” cases): - Formatting/schema tests (invalid JSON is a failure): - Safety/refusal tests (domain-specific): - Recovery tests (tool failure, partial data): Ship gates (fill these in): - Block release if: - Any test can mutate user/customer data incorrectly - Schema breaks downstream parsing - Required citations are missing - Canary plan: - Traffic slice: - Metrics to watch (tool errors, refusal rate shifts, latency): - Auto-rollback rule: 5) OBSERVABILITY & AUDIT - What you log: prompts, retrieved docs IDs, tool calls, outputs - Redaction rules (fields to redact): - User-facing history: what users can view/export - Admin controls: disable features by role/org 6) COST & UX CONTROLS - Context budget rules (what gets trimmed first): - Progressive depth (cheap draft → user confirms deeper run): - Caching artifacts (summaries, extracted entities): - Metering UI (how users see background work): DONE CHECK If you can swap model providers and the spec still holds, you built a product surface. If the spec changes per model, you’re shipping vendor behavior, not your own.