Most “AI features” are a liability dressed up as a demo.
We all know why: the model is probabilistic, the UI is deterministic, and the support queue is where the truth lands. Yet teams keep shipping chat boxes, auto-write buttons, and “copilot” side panels with the same product spec style they used for filters and exports. That mismatch is what’s breaking products in 2026—not model quality.
The product primitive that matters now isn’t “AI in the workflow.” It’s an AI contract: a user-facing, enforceable definition of what the system will do, what it will not do, what it will cite, how it will ask for confirmation, what it will store, and what it will fall back to when uncertainty spikes. If your AI can’t explain its boundary conditions, your product doesn’t have a feature—it has a roulette wheel.
The shift: from model selection to behavior governance
The last few years trained teams to obsess over model selection: OpenAI GPT-4 class models, Anthropic Claude, Google Gemini, Meta Llama, Mistral—pick your flavor and benchmark your prompt. That’s table stakes now. The hard part is shipping behavior that stays consistent across model updates, price changes, outages, and policy shifts.
Two public realities forced the issue.
First: regulation stopped being hypothetical. The EU AI Act became law in 2024, and its obligations land on providers and deployers depending on the system and use case. Even if you’re not in Europe, your enterprise buyers are. They’re asking procurement questions that sound like compliance but are really about product reliability: risk classification, documentation, human oversight, logging, and how you handle user data.
Second: “model drift” became product drift. Vendors change model behavior, tools, system prompts, and safety layers. OpenAI has repeatedly updated model families and product surfaces (ChatGPT, the API, function calling/tool use). Anthropic and Google have done the same. If your product promise is implicitly “whatever the model does,” you’re not shipping a product—your vendor is.
“AI contract” is not a policy doc. It’s product surface area.
Most teams hear “contract” and think legal. Wrong layer. This is a product spec that users can feel.
An AI contract has three properties:
- It’s explicit. The user can see the rules: sources required, actions gated, memory rules, and what counts as “unknown.”
- It’s enforceable. The system can refuse, ask for confirmation, route to a deterministic method, or escalate to a human.
- It’s stable under change. You can swap models and preserve behavior because the contract is implemented in orchestration, tooling, and UI—not vibes.
Think about GitHub Copilot versus “a code chat tool.” Copilot’s contract is built around in-editor suggestions, developer control, and a workflow that keeps the developer as the executor. Or think about Microsoft 365 Copilot: the sales pitch is “grounded in your data,” but the real contract is the set of permissions, compliance boundaries, and administrative controls inside Microsoft Graph and Purview. The product isn’t the model; it’s the governance surface.
Key Takeaway
If your “AI feature” doesn’t have an explicit boundary, it isn’t a feature. It’s a support burden waiting for a customer with a lawyer.
What users actually want: predictable failure modes
Users don’t demand perfection. They demand that failure looks sane. If the system is unsure, it should say so. If it’s about to take an irreversible action, it should ask. If it can’t cite a source, it should switch modes or stop.
That’s the contract: not “the AI is smart,” but “the AI is safe to trust for this class of task.”
Shipping AI is shipping uncertainty. Your product either contains it—or exports it to customers.
The contract stack: the five layers you actually need
Most teams try to solve this with prompts. Prompts are the thinnest layer; they’re also the easiest to break. Build a stack where each layer constrains the next.
- Intent framing (UI). Don’t ask “How can I help?” Ask “Draft a reply,” “Summarize this thread,” “Create a PRD from these notes.” Constrain the task.
- Policy (product rules). What’s allowed, what’s disallowed, what needs confirmation, what needs citations, what must be deterministic.
- Grounding (data access). Retrieval from a known corpus, with clear permissioning. If you can’t ground it, you shouldn’t sound confident.
- Tooling (actions). Function calling / tool use to execute operations. Every tool needs scopes, audit logs, and guardrails.
- Fallback & escalation. Route to search, a rules engine, a template, or a human. “I don’t know” is a feature.
This is why the “agent” conversation is often backwards. Agents are not a capability. Agents are a contract risk. If you can’t define and enforce your tool boundaries, you don’t get to ship an agent that clicks buttons and moves money.
Tooling reality in 2026: the market converged, but the tradeoffs are sharp
The orchestration ecosystem matured fast: LangChain made “chains” mainstream; LlamaIndex anchored retrieval; OpenAI pushed tool calling; Anthropic leaned into tool use and long-context; Google pushed Gemini across Workspace; Microsoft built Copilot across its suite; AWS and Google Cloud turned “foundation model access” into a platform category (Amazon Bedrock, Vertex AI).
But product teams still pick stacks based on developer comfort, not contract requirements. That’s backwards. Choose tooling based on the boundaries you must enforce: data governance, auditability, deterministic fallbacks, and multi-model resilience.
Table 1: Comparison of common LLM app stacks by product contract needs (not model quality)
| Stack | Strength for contracts | Tradeoff | Best fit |
|---|---|---|---|
| OpenAI API (tool calling) + your app | Strong developer ergonomics for tools; fast iteration | Vendor changes can shift behavior; you own governance glue | Consumer + SMB apps with tight action scopes |
| Anthropic API (tool use) + your app | Clear tool-use patterns; good fit for controlled workflows | Same governance burden; model choice is narrower | Operations tools where refusal/verification matters |
| Amazon Bedrock | Enterprise posture; model choice across providers; AWS governance primitives | More platform wiring; less “one SDK to rule them all” simplicity | Regulated industries; AWS-native orgs |
| Google Vertex AI (Gemini + tooling) | Strong integration with Google Cloud; enterprise controls | Ecosystem lock-in; product teams must understand GCP IAM deeply | GCP-first companies; data-heavy products |
| Self-hosted open models (Meta Llama, Mistral) + vLLM | Max control over data path; stable behavior under your release process | You own infra, safety layers, evals, incident response | Privacy-sensitive products; predictable workloads |
Contrarian take: multi-model is overrated unless you have a contract
“We’re multi-model” is a common pitch. Without an AI contract, it just means inconsistent UX. Different refusal styles, different verbosity, different citation habits, different tool-use reliability. Users feel it immediately.
Multi-model only becomes a strength once the contract layer normalizes behavior: same UI constraints, same policy checks, same tooling schemas, same fallback rules, and a test suite that asserts the product promise.
Make the contract testable: treat prompts like code, treat behavior like an API
The fastest way to tell if a team is serious: ask where the tests are.
You don’t need fancy eval theater. You need a small suite that locks in the behaviors you promised: “must cite,” “must ask before sending,” “must not store,” “must refuse medical diagnosis,” “must not reveal secrets,” “must summarize within a structure.” Then you run it every time you touch prompts, retrieval, tools, or models.
In the open ecosystem, tools like Giskard, TruLens, and Ragas exist for evaluation patterns, and teams also wire up their own harnesses. The point isn’t the tool. The point is that “works on my prompt” is not a release criterion.
# Minimal contract-style checks (pseudo-harness)
# Store a small set of prompts + expected invariants.
cases:
- name: "refund_policy_requires_citation"
input: "What is your refund policy?"
invariants:
- must_include: ["Source:"]
- must_not_include: ["I guarantee", "always"]
- name: "send_email_requires_confirmation"
input: "Email Alex that the invoice is overdue and send it."
invariants:
- must_include: ["Draft", "Confirm"]
- must_not_call_tools: ["send_email"]
Two notes that matter in production:
- Invariant tests beat golden outputs. You’re checking properties (citations present, tool not called, confirmation requested), not exact wording.
- Logs are part of the contract. If a tool was invoked, you need an audit trail. If retrieval happened, you need to know what was retrieved. Without that, you can’t debug customer complaints.
The hard edge: provenance, consent, and “memory” that users can control
Most product teams still treat memory like a novelty. Users treat it like surveillance until proven otherwise.
OpenAI’s ChatGPT has offered memory features; Microsoft and Google have deep context inside their suites; Notion AI and Slack AI moved toward enterprise-ready patterns. The direction is clear: the assistant remembers. The question is whether your product gives the user control that feels real.
A credible AI contract exposes:
- What gets stored (and where): prompts, outputs, tool calls, retrieved documents.
- Who can see it: the user, their admin, support staff, third-party vendors.
- How to delete it: not “contact support,” but a product action with predictable effect.
- What trains what: whether user data is used to improve models (and what opt-out looks like).
- What’s session-only: a mode that doesn’t persist beyond the immediate task.
This is not just privacy virtue signaling. It’s how you win deals. Enterprise buyers are sick of hand-wavy answers. Consumer users are sick of being surprised.
Table 2: AI contract checklist (ship this as product requirements, not a slide)
| Contract area | User-visible signal | Enforcement mechanism | What to log |
|---|---|---|---|
| Citations & provenance | “Source” links next to claims; highlight quoted text | Require retrieval for certain intents; block confident claims without sources | Retrieved doc IDs, snippets, ranking, timestamps |
| Action gating | Preview + Confirm before irreversible actions | Tool scopes; two-step confirmations; allowlist tools per role | Tool name, parameters, user ID, approval event |
| Data access | “Using: Drive folder X / Slack channel Y” indicator | IAM-based connectors; permission checks at retrieval time | Connector used, permission decision, query |
| Memory controls | Memory on/off toggle; “forget this” action | Separate stores for session vs long-term; user/admin policies | Write events, deletes, retention policy applied |
| Refusal & escalation | Clear “can’t do that” with next-best option | Policy classifier; human handoff; deterministic fallback | Refusal category, escalation path, user resolution |
What to do Monday: write one contract and ship it end-to-end
If this sounds heavy, good. It’s heavier than sprinkling a model into the UI. That’s why it’s defensible.
Pick one workflow where AI is already creeping in—support replies, sales emails, internal incident summaries, code review suggestions, invoice follow-ups. Then write the contract in plain language first, as if it’s a user-facing promise. After that, implement the enforcement and the logs. Only then worry about model tweaks.
Here’s the litmus test: could your support team answer, instantly and consistently, “What does the AI do here, and what does it never do?” If not, you haven’t shipped a product. You shipped a mystery.
The prediction worth sitting with: by the end of 2026, “AI features” will be priced like commodities, and “AI contracts” will be priced like trust. Which one is your roadmap actually building?