The most common product failure of the AI era isn’t “bad model choice.” It’s treating a language model like a feature instead of treating it like an untrusted runtime.
You can watch the mistake in public. Teams ship a chat surface, wire it to a model, sprinkle in “tools,” and call it done. Then the product starts doing the two things users hate most: it hallucinates confidently and it refuses tasks unpredictably. Everyone blames the model. The real issue is that the product never defined a contract.
In 2026, the winning “AI products” won’t look like chatbots. They’ll look like software with a strict tool API, deterministic state, explicit permissions, and traces that make sense. The model becomes a planner and a router. Your product becomes the authority.
The contrarian take: prompts are not product surface area
Prompting is a developer convenience. It’s not a durable interface contract. Users don’t care what your prompt says; they care what the system does—and whether it does it every time.
Look at where serious tooling has gone: OpenAI added function calling and later standardized “tools” patterns across their APIs; Anthropic pushed “tool use” and “computer use” workflows; Google’s Gemini stack emphasizes tool integrations and structured output; Microsoft built Copilot around Graph-connected actions with admin controls; Amazon’s Bedrock leans into model choice but still expects you to build guardrails and orchestration. The center of gravity is clear: structured calls, not free-form chat.
Yet product teams keep shipping “natural language” as if it’s a spec. Natural language is not a spec. It’s an input modality. Treat it like a keyboard: powerful, messy, and untrusted.
“The purpose of computing is insight, not numbers.” — Richard Hamming
Hamming’s point maps cleanly here: the purpose of a model isn’t prose, it’s correct action and useful outcomes. Prose is often the exhaust.
Your LLM stack is an untrusted runtime—design like it
If you’ve built distributed systems, the mental model is familiar. LLMs are non-deterministic. They can be coerced. They can fail silently. They can produce output that looks valid but is wrong. That’s not “AI risk,” that’s just an unreliable component in your architecture.
So design it like one:
- Assume output can be adversarial. Prompt injection is not theoretical; it’s a predictable property of systems that ingest untrusted text and then execute downstream actions.
- Assume it will be inconsistent. Even with temperature controls, model updates and hidden system changes can alter behavior. If your product relies on “it usually answers like this,” you don’t have a product.
- Assume it will be unavailable or rate-limited. If you don’t have graceful degradation, your UX is a single point of failure tied to someone else’s uptime and policies.
- Assume it will be expensive at the wrong times. Without budgets, caching, and bounded work, one user can accidentally trigger a costly cascade.
- Assume it will lie. Not out of malice—out of optimization pressure to produce plausible text. Your system must detect and contain that.
Key Takeaway
Build the model integration the way you’d build payments: strict inputs, strict outputs, auditable state transitions, and explicit permissions. Nobody ships “just vibes” for payments.
The hard contract: tools, schemas, permissions, and state
If you want reliability, stop asking the model to be reliable. Ask it to propose a plan in a constrained language, then execute only what passes validation.
1) Tools are the product API—treat them as first-class
A “tool” is just an internal API endpoint the model can request. The model should never directly mutate real systems. Your tool layer should. That tool layer needs to look like an API you’d be proud to publish: versioned, documented, permissioned, and observable.
Real teams already have the building blocks: JSON Schema, OpenAPI, gRPC/Protobuf, policy engines (like Open Policy Agent), and standard auth patterns. The AI part is just the planner. Your job is to make the planner safe.
2) Structured output isn’t a nice-to-have; it’s your safety rail
Every tool request from the model should be validated against a schema. If it fails validation, you either ask for a repair (with the exact validation error) or route to a fallback path.
Don’t accept “close enough.” “Close enough” is how you get a tool call that deletes the wrong record, emails the wrong person, or posts a message to the wrong channel.
3) Permissions are product design, not an infra detail
Microsoft learned this the hard way with Copilot-style assistants: enterprise buyers immediately ask, “What can it see?” and “What can it do?” If your answer is “it respects user permissions,” you’re not done. You need explicit scoping: which connectors, which resources, which actions, under what conditions, with what audit trail.
Products that win will expose permissions as understandable UX. Users should be able to tell, in one glance, which tools are armed.
4) State must be deterministic, even if language isn’t
Let the model talk. Don’t let it own the state. Your product should keep the authoritative task state in a database: steps completed, artifacts created, approvals granted, and pending actions. If you can’t reconstruct what happened from logs and state, you can’t debug it, secure it, or support it.
Table 1: Toolcalling orchestration options teams actually use (and the tradeoffs that matter)
| Approach | Best for | Where it breaks | Notable real options |
|---|---|---|---|
| Single-model “function calling” loop | Tight workflows with a small toolset | Tool explosion, messy retries, weak observability without extra work | OpenAI tool/function calling; Anthropic tool use; Google Gemini function calling |
| Graph-based agent orchestration | Multi-step tasks with branching and checkpoints | Can turn into spaghetti if state and policies aren’t explicit | LangGraph (LangChain); LlamaIndex workflows |
| Deterministic workflow engine + LLM “planner” | Regulated or high-stakes actions requiring approvals | More upfront engineering; less “magic” in demos | Temporal; AWS Step Functions; Azure Durable Functions |
| RAG-centric assistant (retrieval + chat) | Q&A and knowledge navigation | Falls apart when users expect actions, not answers | Azure AI Search + Copilot patterns; Amazon Bedrock Knowledge Bases; Elasticsearch vector search |
| “Computer use” / UI automation agent | Legacy systems without APIs; repetitive internal ops | Brittle UI; hard to secure; requires strong sandboxing | Anthropic computer use; Microsoft Playwright (as the automation layer) |
What “good” looks like in real products (not demos)
The market is full of demos that look competent and behave like slot machines. Operator-grade products behave differently: they explain what they’re about to do, ask for confirmation when it matters, and fail in bounded ways.
Replace “assistant chat” with three explicit modes
Most products should separate these experiences instead of mashing them into one textbox:
- Draft mode: the model generates text or a plan, but cannot take actions.
- Action mode: the model can request tool calls, but each call is validated and logged; sensitive calls require explicit approval.
- Report mode: the system generates a post-run report from actual execution logs and artifacts—not from the model’s memory of what it “thinks” happened.
Users immediately understand these modes because they map to real work: propose, execute, document.
Make approvals a product primitive
If the action has an external side effect—sending email, posting to Slack, creating invoices, modifying production configs—ship an approval step. Not because users love friction; because they hate surprise.
GitHub pull requests are the model here: a clean diff, a reviewer, an audit trail. For AI actions, the “diff” is often a set of intended tool calls with parameters, plus the resources they will touch.
Design for refusal as a first-class UX
LLMs refuse for policy reasons, safety filters, and ambiguous prompts. If refusal breaks the workflow, your product is brittle. The fix isn’t “find a less strict model.” The fix is to route refusals into alternate paths:
- Ask for missing inputs (“Which workspace should I post to?”)
- Offer a manual action button (“Create draft message”)
- Defer to a deterministic template (“Use standard incident update format”)
- Escalate to a human review queue
The product work nobody wants: evaluation, tracing, and policy
“We’ll evaluate later” is how AI features die in production. Evaluations aren’t academic; they’re your regression test suite for a non-deterministic dependency.
Instrument everything you execute
If a tool call happens, log it as if it were a financial transaction: tool name, parameters, user identity, permission context, resource identifiers, timestamps, and downstream results. If your logs are “model said X,” you can’t operate the system.
Tools like LangSmith (LangChain), Arize Phoenix, Weights & Biases Weave, and OpenTelemetry-based tracing patterns exist because this is now a standard production problem: you need spans across model calls, retrieval, tool execution, and UI actions.
Write evals that reflect your product risk
Most teams start by scoring “answer quality.” That’s not where the product risk is. The risk is in actions: wrong recipient, wrong record, wrong permissions, wrong tool, wrong ordering, missing approvals.
So your eval set should include adversarial and operational cases: prompt injection attempts in retrieved documents, ambiguous user instructions, stale data, missing permissions, and “looks correct but isn’t” tool parameters.
Table 2: A practical contract checklist for shipping toolcalling features
| Contract area | What to specify | Implementation artifact | Failure behavior |
|---|---|---|---|
| Tool schemas | Inputs/outputs, required fields, allowed enums, versioning | JSON Schema or OpenAPI; strict validators | Reject + repair request with exact validation errors |
| Permissions | Who can call which tools on which resources | RBAC/ABAC rules; OPA policies; scoped OAuth tokens | Block + explain; offer request-access workflow |
| Approvals | Which actions require confirmation, and what the user reviews | UI “action diff”; queued execution; reviewer identity | Pause execution; provide editable plan |
| State & idempotency | Task state machine; retry rules; dedupe keys | DB-backed state; idempotency tokens; workflow engine | Safe retry; never double-execute side effects |
| Observability & audit | What gets logged; trace IDs; retention; redaction | Structured logs; OpenTelemetry spans; access logs | Fail closed on missing policy context |
A minimal “tool contract” spec (steal this pattern)
If you want a north star, make it impossible to ship a tool without these fields:
- Name and version (tools change; you need compatibility and migration)
- Schema (inputs/outputs, required fields, types, constraints)
- Permission scope (who can call it, and on what)
- Side effects (what it mutates, what it sends, what it creates)
- Idempotency strategy (how you prevent double execution)
- Audit fields (what to log, how to redact)
# Example: tool call validation flow (pseudocode)
request = model.propose_tool_call(user_input, context)
validate_schema(request)
check_policy(user, request.tool, request.resource)
if request.requires_approval:
approval = get_user_approval(diff=request.preview)
if not approval: abort()
result = execute_tool(request, idempotency_key=request.idempotency_key)
log_audit_trail(request, result)
return render_result(result)
A 2026 prediction: the UI will be language, the product will be policy
Natural language interfaces will keep spreading because they’re a better default than nested menus for many tasks. But language won’t be the moat. The moat will be everything underneath: tool coverage, contracts, permissions, traces, admin controls, and a workflow model that fits how work actually happens.
Most startups chasing “AI agent” positioning will get trapped competing on model vibes. The durable companies will do something less sexy: ship a tool contract and enforce it harder than their competitors are willing to.
Your next action is simple and uncomfortable: pick one workflow in your product where the assistant can cause real damage—then write down the exact contract for every tool it can touch. If you can’t do that on a single page, you’re not building an agent. You’re building a slot machine with a send button.