MODEL ROUTING CONTRACT + EVAL STARTER CHECKLIST (copy/paste)

Goal: Make multi-model AI behavior predictable. This template forces you to define routing decisions, fallbacks, and evidence logs so you can debug incidents and pass security/compliance reviews.

1) ROUTING CONTRACT (INTERFACE)
A. Request envelope (inputs)
- user_text: string
- user_id / org_id: identifiers
- channel: web | mobile | api | internal
- locale + region: for jurisdiction constraints
- data_class: public | internal | confidential | regulated
- allowed_tools: list (explicit allowlist)
- max_latency_ms: set by product surface (interactive vs batch)

B. Router decision (outputs)
- route_id: stable string (e.g., support_refund_low_risk_v3)
- risk_tier: low | medium | high
- model_lane: fast_small | premium_reasoning | private_open_weight
- decoding: temperature/top_p + any provider-specific settings you use
- context_plan:
 - retrieval: off | light | deep
 - index_name + top_k
 - rerank: on/off
- tool_plan:
 - tool names + required/optional
 - max_tool_calls
- output_contract:
 - json schema name OR “freeform_with_citations”
 - citation_required: true/false
- fallbacks:
 - on_timeout: retry strategy + next lane
 - on_schema_fail: repair prompt then retry; else escalate
 - on_refusal: alternate lane or human handoff

C. Required logs (non-negotiable)
- trace_id spanning router → retrieval → tools → final model
- route_id + model used + version identifiers
- retrieved document IDs/chunk IDs (not raw docs if sensitive)
- tool call args + tool results (redact secrets)
- output validation result (pass/fail + reason)
- policy checks fired (which rules, which tier)

2) RISK TIERS (DEFINE THEM)
- Low: no PII, no irreversible actions, user can verify easily
- Medium: may include PII OR business-critical guidance, reversible actions only
- High: regulated domains (health/finance/employment/legal) OR irreversible actions OR privileged internal access
For each tier, specify:
- which tools are allowed
- whether citations are mandatory
- whether a human approval step exists

3) MINIMUM EVAL SUITE (RUN ON EVERY ROUTE CHANGE)
- Schema/contract tests: can the route produce valid JSON for 20–50 fixtures?
- Retrieval groundedness: for a fixed doc set, do citations point to relevant chunks?
- Tool reliability: replay traces with mocked tools; inject timeouts and malformed outputs
- Safety/policy regression: prompts that must refuse + prompts that must comply
- Cost/latency budget: enforce max context size + max tool calls; load test per lane

4) DEPLOYMENT CHECKS (BEFORE YOU SHIP A NEW ROUTER POLICY)
- Can you replay a production trace end-to-end in staging?
- Do you have a tested provider failover path?
- Are model/version pins explicit (no silent upgrades)?
- Does every route have an owner and an on-call playbook link?

5) STARTER PLAYBOOK FOR INCIDENTS
- If outputs break schema: lock to most reliable lane; disable “repair loops”; ship hotfix with stricter validation
- If hallucinations spike: reduce retrieval top_k; enable reranking; require citations; route to a more grounded lane
- If tool calls loop: cap max_tool_calls; add tool result sanity checks; add a stop condition
- If provider errors increase: flip traffic to fallback lane; degrade gracefully (read-only mode; human handoff)

Use this file as a living contract. If your router can’t explain its choices in logs, it’s not a router—it’s a guess.