The most expensive UI you can ship in 2026 is still a chat box.
Not because it’s hard to build. Because it’s easy to ship and hard to govern. A chat interface invites users to ask for outcomes (“make it so”), while your product still lives in the world of permissions, side effects, compliance, and blame. That mismatch is why “AI features” keep getting pulled back, throttled, or quietly relabeled as “assist.”
Here’s the contrarian position: stop treating the model as the product surface. Treat it as a compiler that translates intent into a constrained, inspectable plan. Your actual product is the agentic UI: a set of workflow affordances that make automation legible, bounded, reversible, and attributable.
Chat is a great way to start an action. It’s a terrible way to finish one.
The real problem isn’t hallucinations. It’s missing product contracts.
Engineers obsess over model quality; operators obsess over risk; founders obsess over speed. All three groups often miss the same thing: most AI products still don’t have a clear contract for what happens next.
When a user asks an LLM to “refund the last invoice,” you need crisp answers to product questions—not ML questions:
- Authority: Which identity is acting? The user? A service account? A delegated role?
- Scope: What’s in-bounds? Only invoices in the current workspace? Only those with a certain status?
- Evidence: What inputs were used? Which records were read? What context was assumed?
- Change log: What was written? What was deleted? What downstream systems were touched?
- Reversibility: Can we undo it? If not, can we compensate it?
Products that answer those questions feel “safe” even when the model is imperfect. Products that don’t feel unsafe even when the model is good.
“Agentic UI” is just workflow design under uncertainty
Call it agents, copilots, assistants—doesn’t matter. Users want the outcome, your business needs the constraints, and the model provides probabilistic glue in the middle.
An agentic UI is the interface that lets users:
- see the plan before it runs
- edit the plan using native controls (not prompt gymnastics)
- approve with clear scope
- watch execution with checkpoints
- inspect artifacts afterward (what changed, why, and by whom)
This is not a theoretical stance. Microsoft’s GitHub Copilot moved from “suggest code” toward “Copilot Edits” and task-oriented flows in editors; Atlassian’s Rovo positions itself around finding, summarizing, and acting across Jira and Confluence; Salesforce pushes Einstein features inside CRM objects where approvals, fields, and audit histories already exist. These companies are converging on the same product truth: the UI surface needs to be structured even if the language input is not.
The chat box is the new “import CSV”
Early SaaS had an “import CSV” button as a universal escape hatch. It worked, but it was a tax on everyone: messy data in, messy outcomes out. The modern equivalent is “ask the bot.” It’s universal, but it punts on the contract: what data is it using, what does it change, and what happens if it’s wrong?
A chat box is fine as an entry point. It’s irresponsible as the only control plane.
What you should copy from real products (and what you should stop copying)
Most teams copy the wrong part of popular AI products: the chat UI and the marketing language. Copy the mechanics instead.
Copy: “Draft mode” and explicit review steps
Google Workspace and Microsoft 365 both pushed “draft” semantics into writing flows: suggestions are proposed as artifacts, not executed as actions. In developer tools, GitHub Copilot’s best moment is still the one where it suggests and you accept or edit—because acceptance is a clear boundary.
For operator-grade actions—refunds, deletes, permission changes, infra updates—draft mode is table stakes. If your AI can mutate state without an explicit approval step, you’ve built a demo, not a product.
Copy: “Artifacts” you can point to later
OpenAI’s ChatGPT introduced “Custom GPTs” and later workflows that center around reusable behavior; Anthropic’s Claude emphasizes longer context and careful writing; both succeed when the output is an artifact: a doc, a plan, a diff, a checklist. Artifact-first design makes audit and collaboration possible. A pure chat transcript does not.
Stop copying: infinite tool access
There’s a fashion for “connect every tool” via OAuth and let the model figure it out. That’s how you end up with an assistant that can read everything and explain nothing. Product people should treat tool access like production database access: least privilege, scoped tokens, and predictable query shapes.
Table 1: Common agent building blocks (real offerings) and what they’re actually good for
| Layer | Examples | Strength | Product risk if misused |
|---|---|---|---|
| LLM API | OpenAI API, Anthropic API, Google Gemini API | Fast iteration on language + reasoning tasks | Treating text output as execution without verification |
| Model gateway / observability | Azure AI Studio, Amazon Bedrock, LangSmith | Centralize prompts, traces, evaluations, vendor routing | Thinking this replaces product-level audit and approvals |
| Orchestration / agent frameworks | LangChain, LlamaIndex, Microsoft Semantic Kernel | Tool calling, retrieval patterns, multi-step flows | Overbuilding brittle autonomy instead of clear UX |
| Workflow automation | Zapier, Make, n8n | Deterministic triggers/actions; reliable connectors | Stuffing probabilistic decisions into deterministic pipes |
| Identity & access | Okta, Microsoft Entra ID, Google Cloud IAM | Roles, policies, SCIM, audit logs | Ignoring this and shipping a “shared agent” with god mode |
Build the “three panels” UI: Plan, Proof, and Playback
If you’re building an agent that does real work, you need three surfaces. Not as a framework slide—literally as product UI your customers can use.
Panel 1: Plan (what will happen)
Take the user’s request and produce a structured plan they can approve. This can look like a checklist, a diff, a proposed set of API calls, or a Jira-style workflow. The key is that the user can see scope, edit steps, and remove actions.
Panel 2: Proof (why this is the plan)
Show the evidence. Which records did you read? Which policy or rule did you apply? If you used retrieval, show the sources with stable identifiers (document title + link + timestamp if your system supports it). If you can’t show sources, limit what the agent is allowed to do. That’s not philosophy; it’s basic accountability.
Panel 3: Playback (what actually happened)
After execution, provide a timeline: step started, step finished, tool called, record changed, result returned, errors encountered, retries attempted. This is the difference between “AI did something weird” and “Step 3 failed because the invoice status changed between read and write.”
Key Takeaway
If your agent can’t produce a plan the user can edit, evidence the user can inspect, and a playback log the operator can debug, it’s not an agent. It’s a roulette wheel with a chat UI.
Guardrails that aren’t theater: permissions, budgets, and deterministic escape hatches
Most “AI safety” in product is theater: long system prompts, a content policy link, and vibes. Real guardrails are mechanical. They’re enforced in code and visible in UI.
1) Permissioning: the agent must be a first-class identity
Stop running actions as “whoever is logged in.” Create an agent identity with explicit scopes. Use the same primitives you already use for humans and services: roles, audit logs, token rotation, and least privilege. If you’re in an enterprise environment, expect your customers to ask about SSO (Okta, Entra ID), SCIM provisioning, and audit exports. If you can’t answer, you’re not enterprise-ready.
2) Budgets: constrain blast radius with explicit limits
Budgets are not only about API cost. They’re about operational impact: how many records can this agent touch per run, how many emails can it send, how many tickets can it close, how many deletions can it propose. Your UI should expose those ceilings as product settings, not hidden config.
3) Deterministic escape hatches: always provide a non-AI path
This is the part teams hate because it feels like admitting defeat. Do it anyway. Every agentic flow needs a deterministic equivalent: a form, a bulk action, a scripted workflow, a saved view. If the model is down (or just wrong), the user still completes the job.
Table 2: A product checklist for agentic actions that touch real systems
| Area | Non-negotiable UI element | Engineering implementation | What breaks if you skip it |
|---|---|---|---|
| Approval | Preview plan + explicit “Run” button | Two-phase execution (plan → apply) | Accidental destructive actions and blame disputes |
| Scope | Visible filters/targets (which records) | Server-side constraints, not prompt text | Agent touches wrong tenant, project, or dataset |
| Audit | Playback timeline + exportable log | Structured traces with request/response metadata | No way to debug, comply, or learn from failures |
| Rollback | Undo / revert where possible | Compensating transactions, versioning | One bad run becomes permanent damage |
| Human override | “Do it manually” path always available | Deterministic workflow or CRUD UI kept intact | Outages turn into total work stoppage |
What this looks like in shipping software: one concrete flow
Pick a workflow your customers already do, where the pain is real and the state changes are bounded. Example: “Close low-quality support tickets with a refund offer draft, but only for orders under a defined threshold and only if the customer has no open chargebacks.”
A chat box can’t safely do that. An agentic UI can.
- User intent: user asks to clean up tickets for a time range.
- Plan: system generates a list of candidate tickets + proposed actions (close, tag, draft response, refund suggestion) with per-item toggles.
- Proof: each ticket shows the signals used (order status, prior contacts, policy checks) with links into your own objects.
- Approval: user approves in batches; high-risk actions require extra confirmation.
- Playback: timeline shows what was done; failed items are retriable with a clear error reason.
A minimal tool-calling contract (how to keep tools from becoming chaos)
Tool calling gets dangerous when tools are vague. Keep tools boring: narrow inputs, explicit outputs, and server-side validation. Here’s a simplified schema pattern that product teams can understand and engineers can enforce.
{
"tool": "refund_invoice",
"inputs": {
"invoice_id": "inv_123",
"amount": "FULL",
"reason_code": "LATE_DELIVERY"
},
"constraints": {
"max_amount": "FULL",
"allowed_reason_codes": ["LATE_DELIVERY", "DUPLICATE", "CANCELLED"],
"requires_approval": true
},
"expected_output": {
"refund_id": "string",
"status": "SUCCESS|FAILED",
"error": "string|null"
}
}
This is not about making the model smarter. It’s about making your system strict. If the model proposes an out-of-policy reason code, it fails fast. The UI tells the user exactly why.
The 2026 product bet: the moat is governance UX, not model choice
Model quality will keep improving and prices will keep compressing. That’s not where durable differentiation lives. The durable layer is everything you build around the model: identity, approvals, auditability, error recovery, and operator tooling.
Founders keep asking, “Which model should we standardize on?” The better question is: “What do we do when the model is wrong?” Your answer should be visible in the UI, enforced by the backend, and understandable to a compliance person on a bad day.
One action you can take this week: pick your highest-risk AI workflow and add a Playback panel. If you can’t reconstruct what happened from your own logs—inputs, tools called, records changed, and approvals—don’t ship more autonomy. Ship that.