The most expensive line in your AI roadmap is the one that says “add an agent.” Not because agents are useless—because most teams ship them without a contract.
“Contract” doesn’t mean a legal PDF. It means an engineering artifact: explicit allowed actions, forbidden actions, provenance rules, and measurable acceptance tests that catch failures before your users do. If you’ve been treating LLM behavior as a vibe you tune with prompts and a few happy-path evaluations, you’re already behind. 2026 belongs to teams that treat models like untrusted code and ship with enforceable, testable behavioral boundaries.
The uncomfortable truth: your model is not an API, it’s a junior operator
Founders love the story that LLMs are “just another dependency,” like Stripe or Twilio. That story is wrong in the only way that matters: those APIs don’t wake up tomorrow with new behavior because a vendor shipped a new training run. LLMs do. Even if you pin a model version, your system behavior shifts as you tweak prompts, swap tools, change retrieval, or expand context windows.
OpenAI’s GPT-4-era systems made this obvious: the same task can pass today and fail next week depending on your surrounding scaffolding. Anthropic built its brand on controllability and documented “Constitutional AI.” Google’s Gemini line is deeply integrated into Workspace and Android. Meta’s Llama ecosystem is open enough to run anywhere. Different philosophies, same operational reality: probabilistic behavior plus tool access is a new class of production risk.
Engineers are responding with more structure: OpenAI’s structured outputs / JSON mode, function calling patterns across providers, and a resurgence of typed interfaces. That’s directionally correct, but it still misses the core point. You can force JSON. You can’t force intent.
AI products fail in predictable ways because teams refuse to write down what “safe and correct” means in machine-checkable terms.
Model contracts: the missing layer between “prompt” and “product”
A model contract is a set of constraints and proofs that your system enforces around an LLM. Think of it like an internal RFC plus executable checks. It lives alongside code and changes with code. It’s reviewed, tested, and deployed.
There are four parts that matter in practice.
1) Capability boundaries (what the model is allowed to do)
If your agent can send email, create tickets, initiate refunds, change user settings, or run code, you must enumerate those capabilities and put them behind explicit tool interfaces. Don’t let the model “suggest” raw API calls via text. Make tools the only way actions happen, and make tools strict.
2) Prohibited behaviors (what must never happen)
This includes obvious items (data exfiltration, policy violations) and non-obvious ones (inventing citations, guessing PII, “helpfully” expanding scope). If you can’t write it down, you can’t test it. If you can’t test it, you’re hoping.
3) Provenance and memory rules
“Memory” is where products get quietly dangerous. Users love personalization; regulators love audit trails. You need crisp rules for what can be stored, where, for how long, and how it can be used. If you use retrieval-augmented generation (RAG), you need rules for what counts as an acceptable source and how it’s attributed.
4) Acceptance tests (how you prove the above)
Not a demo. Not a few screenshots. A suite that runs in CI, fails the build, and is hard to bypass. Your contract is only as real as your ability to stop a deployment.
Key Takeaway
If your AI system can take actions, your “spec” can’t be a prompt. It has to be a contract with tests that gate releases.
The stack is converging: tool calling, structured outputs, and evals—pick your tradeoffs
The market is full of “agent frameworks,” and most are thin wrappers around the same primitives: tool calling, state, planning, and retries. The real differentiator is how they help you enforce contracts: typed tools, sandboxing, permissioning, traceability, and eval-driven development.
Table 1: Comparison of common LLM app stacks and how well they support enforceable model contracts
| Stack / Product | What it is | Strength for contracts | Tradeoff to expect |
|---|---|---|---|
| OpenAI API (function calling / structured outputs) | Commercial model API with native tool-calling patterns | Good: typed I/O, strong ecosystem, easy to standardize | Vendor dependency; behavior shaped by prompt+model choices |
| Anthropic API (tool use) | Commercial model API emphasizing controllability and safety posture | Good: clear tool-use patterns; strong for policy-driven apps | You still need your own hard guards and eval gates |
| LangChain (open-source) | Popular orchestration library for chains/agents/tools | Mixed: fast iteration; lots of integrations | Easy to create spaghetti graphs without strict interfaces |
| LlamaIndex (open-source) | RAG-focused framework: indexing, retrieval, connectors | Good for provenance rules: sources, retrieval layers, pipelines | RAG quality is operationally fragile without evals and curation |
| Microsoft Semantic Kernel | Orchestration SDK designed for “plugins” and enterprise workflows | Good: structured plugin model; fits.NET/enterprise patterns | Added complexity; still requires discipline in permissions and tests |
Notice what’s missing from most “which framework should we use?” debates: none of these automatically makes your system safe or reliable. They only make it easier to build the surface area that can fail.
What “contract-first” looks like in a real repo
Contract-first AI teams organize work around three artifacts: (1) tool schemas, (2) policy, (3) evals. Prompts exist, but they’re subordinate. The prompt is an implementation detail; the contract is the product.
Tool schemas that reject ambiguity
Your tools are the boundary between probabilistic text and deterministic systems. Treat them like public APIs with strict validation, idempotency where possible, and clear error semantics. “Be liberal in what you accept” is how you get an agent that surprises your finance team.
Use JSON Schema (or equivalent) and fail closed. If a model sends an unknown field, reject it. If it omits a required field, reject it. If it requests an action outside a permission scope, reject it. Then force the model to recover via a structured error message that’s safe to reveal.
# Example: a strict tool boundary with JSON Schema validation
# (language-agnostic pseudo-CLI)
validate-tool-call --schema refund.schema.json --input tool_call.json
# exit 1 on unknown fields, missing required fields, or invalid enums
Policy as code, not “guidelines”
Write a policy file the same way you write a Terraform module: explicit, reviewable, diffable. Map policy to tool permissions. If your model shouldn’t email outside a domain allowlist, the tool enforces it. If your model shouldn’t read certain documents, retrieval enforces it. The model can’t be trusted to “remember” rules consistently.
Evals that gate merges, not blog posts
Everyone now claims they “do evals.” The usual reality: a notebook, a handful of examples, and a vague sense of improvement. Serious teams run evals in CI and treat regressions like failing unit tests.
Tools like OpenAI Evals (open-source), DeepEval, and promptfoo exist precisely because manual spot checks don’t scale. Even if you don’t adopt a framework, the principle is non-negotiable: keep a fixed set of adversarial and representative cases, run them on every change, and block the build when you break guarantees.
- Adversarial prompts that try to jailbreak policies relevant to your app (refund abuse, data disclosure, tool misuse).
- Retrieval trap cases where the correct answer depends on citing the right document (and refusing if sources are missing).
- Tool misuse cases where the model must ask a clarifying question rather than guessing required parameters.
- Latency and cost guardrails expressed as budgets (timeouts, max tool calls, max retries) rather than vibes.
- Regression fixtures based on real incidents you’ve had—sanitized and turned into tests.
RAG is not a feature. It’s a liability unless you treat provenance like a product requirement
Most founders add RAG because it demos well: “Look, it knows our docs.” Then the system ships and answers confidently from an outdated Confluence page, a half-migrated Notion workspace, or a PDF someone uploaded in 2019. The failure mode isn’t “the model hallucinates.” The failure mode is “your knowledge base is a mess, and the model makes it look authoritative.”
Contract-first RAG starts with provenance rules: what sources are allowed, what freshness is required, and how citations are represented in outputs. LlamaIndex and LangChain both support patterns for attaching metadata to nodes and carrying it into responses. That’s useful, but it still doesn’t solve the governance problem: who owns source quality, and what happens when sources conflict?
Table 2: A practical contract checklist for action-taking LLM systems
| Contract area | What to write down | What to enforce in code | How to test |
|---|---|---|---|
| Tool permissions | Allowed tools per role/workspace; allowed targets (domains, projects) | Allowlists, scopes, server-side auth checks, rate limits | Eval cases attempting forbidden actions; unit tests for scope checks |
| Output schema | Exact JSON fields for tool calls and user-visible responses | Schema validation; reject unknown fields; fail closed | Golden tests for valid/invalid payloads; fuzz invalid fields |
| Provenance | Approved source systems; citation format; freshness expectations | Retriever filters; metadata propagation; citation requirement gates | Docs with conflicting info; tests that require correct citation or refusal |
| Refusal & escalation | When to refuse, ask clarifying questions, or route to human | Server-side decision points; “human-in-the-loop” queues | Edge cases: missing params, ambiguous intent, policy conflicts |
| Observability | What to log, redact, and retain; incident response triggers | Trace IDs, tool-call logs, redaction, retention controls | Chaos tests: tool failures, timeouts, partial outages; verify safe degradation |
The contrarian move: stop chasing “more autonomy,” start pricing and packaging “more guarantees”
Most AI roadmaps are autonomy theater: more tools, longer chains, bigger context, fewer clicks. Users clap in demos and then punish you in production. Autonomy expands the blast radius of a single bad decision.
The product strategy that wins in 2026 is boring on purpose: sell reliability. Sell controls. Sell auditability. Sell predictable behavior under stress.
This is already visible in enterprise buying behavior around Microsoft 365 Copilot and Google Workspace add-ons: the buyer isn’t just a user; it’s security, legal, and IT. If you can’t explain data handling, permission boundaries, and logging, you don’t get deployed broadly. Founders who keep treating this as “enterprise paperwork” miss the point: the controls are the product.
Design patterns that age well
- Two-phase execution: the model proposes; deterministic code approves and executes (or asks for user confirmation).
- Idempotent tools: if the model retries, you don’t double-refund or double-email.
- Limited context by default: give the model the minimum needed; expand only with explicit user intent.
- Refusal as a feature: refusal paths that are helpful, not scolding—“I can’t do that, here’s what I can do.”
- Kill switches: per-tool and per-tenant toggles that ops can flip without a redeploy.
What to do this quarter: write one contract and make it real
If you’re a founder or operator, don’t start with a platform rebuild. Pick one agentic workflow that can cause damage: refunds, outbound email, calendar scheduling, database writes, cloud operations, Jira automation—anything with side effects.
Then do the uncomfortable, high-ROI work: write the contract, wire the enforcement, and add eval gates. If you can’t block a deploy on contract failure, you don’t have a contract—you have documentation.
Here’s the question worth sitting with before you add another tool to your agent: what is the most embarrassing, plausible thing your system could do with this new capability—and what code will prevent it? If you can’t answer in one page and a failing test, you’re not ready to ship it.