MODEL ROUTING CONTRACT + EVAL STARTER CHECKLIST (copy/paste) Goal: Make multi-model AI behavior predictable. This template forces you to define routing decisions, fallbacks, and evidence logs so you can debug incidents and pass security/compliance reviews. 1) ROUTING CONTRACT (INTERFACE) A. Request envelope (inputs) - user_text: string - user_id / org_id: identifiers - channel: web | mobile | api | internal - locale + region: for jurisdiction constraints - data_class: public | internal | confidential | regulated - allowed_tools: list (explicit allowlist) - max_latency_ms: set by product surface (interactive vs batch) B. Router decision (outputs) - route_id: stable string (e.g., support_refund_low_risk_v3) - risk_tier: low | medium | high - model_lane: fast_small | premium_reasoning | private_open_weight - decoding: temperature/top_p + any provider-specific settings you use - context_plan: - retrieval: off | light | deep - index_name + top_k - rerank: on/off - tool_plan: - tool names + required/optional - max_tool_calls - output_contract: - json schema name OR “freeform_with_citations” - citation_required: true/false - fallbacks: - on_timeout: retry strategy + next lane - on_schema_fail: repair prompt then retry; else escalate - on_refusal: alternate lane or human handoff C. Required logs (non-negotiable) - trace_id spanning router → retrieval → tools → final model - route_id + model used + version identifiers - retrieved document IDs/chunk IDs (not raw docs if sensitive) - tool call args + tool results (redact secrets) - output validation result (pass/fail + reason) - policy checks fired (which rules, which tier) 2) RISK TIERS (DEFINE THEM) - Low: no PII, no irreversible actions, user can verify easily - Medium: may include PII OR business-critical guidance, reversible actions only - High: regulated domains (health/finance/employment/legal) OR irreversible actions OR privileged internal access For each tier, specify: - which tools are allowed - whether citations are mandatory - whether a human approval step exists 3) MINIMUM EVAL SUITE (RUN ON EVERY ROUTE CHANGE) - Schema/contract tests: can the route produce valid JSON for 20–50 fixtures? - Retrieval groundedness: for a fixed doc set, do citations point to relevant chunks? - Tool reliability: replay traces with mocked tools; inject timeouts and malformed outputs - Safety/policy regression: prompts that must refuse + prompts that must comply - Cost/latency budget: enforce max context size + max tool calls; load test per lane 4) DEPLOYMENT CHECKS (BEFORE YOU SHIP A NEW ROUTER POLICY) - Can you replay a production trace end-to-end in staging? - Do you have a tested provider failover path? - Are model/version pins explicit (no silent upgrades)? - Does every route have an owner and an on-call playbook link? 5) STARTER PLAYBOOK FOR INCIDENTS - If outputs break schema: lock to most reliable lane; disable “repair loops”; ship hotfix with stricter validation - If hallucinations spike: reduce retrieval top_k; enable reranking; require citations; route to a more grounded lane - If tool calls loop: cap max_tool_calls; add tool result sanity checks; add a stop condition - If provider errors increase: flip traffic to fallback lane; degrade gracefully (read-only mode; human handoff) Use this file as a living contract. If your router can’t explain its choices in logs, it’s not a router—it’s a guess.