People keep shipping “agents” that can talk, browse, and click around—then act surprised when the first production incident is a double-billed customer, a rogue permission, or an irreproducible decision. That’s not bad luck. It’s the predictable outcome of treating language models like employees instead of like software.
Here’s the contrarian take: the next wave isn’t “more autonomous agents.” It’s transactional AI—LLMs constrained by the same disciplines that made payments, ads, and infrastructure reliable: strict interfaces, deterministic side effects, and auditability. If you’re a founder or operator, your competitive edge won’t be a clever prompt. It’ll be a clean transaction boundary between model output and system state.
“Agents” broke at the first contact with the real world: identity, money, and blame
A demo agent is a one-off performance. Production is a system that must be correct on a Tuesday night, under partial outage, with a junior on-call, and a compliance team that wants a paper trail.
Most agent stacks still treat the model as both planner and executor. That’s backwards. The planner can be probabilistic. The executor has to be boring.
In real systems, three forces collide:
- Identity: OAuth scopes, short-lived tokens, delegated access, and per-tenant policy are not optional. If your “agent” can’t prove which principal acted, it’s not shippable.
- Money: Billing, refunds, credits, procurement approvals, and invoice disputes demand idempotency, trace IDs, and reconciliation. LLMs don’t do reconciliation; ledgers do.
- Blame: If a customer asks “why did this happen?” you need a human-readable chain of custody: input → policy → tool calls → side effects. “The model decided” is not an answer.
Engineers know this already. The industry’s mistake is pretending that adding “tool use” magically solves it. Tool use is table stakes; transaction semantics are the product.
Transactional AI: treat the model like an untrusted planner, not an operator
Think of an LLM as a component that proposes intents. Your system decides whether those intents become writes.
“Transactional AI” means the side effects happen inside a controlled runtime that enforces:
- Explicit contracts: JSON schemas, typed tool interfaces, and strict validation before any call leaves your boundary.
- Idempotency: Every mutation has an idempotency key and a replay-safe handler.
- Atomicity (where possible): Either the workflow completes to a known checkpoint or it compensates cleanly.
- Durable logs: Append-only event trails with correlation IDs.
- Policy gates: The model can’t grant itself permissions. Policies live outside the model, evaluated by code.
That’s not theory. It’s how payment processors, cloud control planes, and CI/CD systems survive. The new thing is applying that discipline to LLM-driven workflows.
Key Takeaway
Stop asking “Can the model do the task?” Start asking “Can we bound the model’s output to a transaction we can validate, replay, and explain?”
What “transactional” looks like in an LLM workflow
A transactional AI workflow has a narrow set of allowed actions. The model chooses among them, but it never improvises new privileges or hidden side effects.
Example: a support agent that issues refunds. The LLM can draft a refund plan, but the final step is a signed, validated API call executed by a service that enforces limits (amount thresholds, account age, fraud rules) and writes to a ledger. The model is not the ledger.
The tooling market is converging on “agent runtimes,” but most teams still ship spaghetti
By 2026, it’s normal for teams to use frameworks like LangChain and LlamaIndex for retrieval and orchestration, and to evaluate with purpose-built tools like LangSmith (LangChain’s platform) or Braintrust. For deployment, you see managed options like OpenAI’s API and Azure OpenAI Service, and open-source models via Hugging Face or Ollama in dev setups.
But here’s the uncomfortable truth: frameworks can make it easier to build bad systems. They reduce friction, so you glue together a planner, a retriever, and a tool executor—then discover your “agent” has no coherent boundary for errors, retries, or policy.
Table 1: Comparison of common orchestration/evaluation components (what they’re good for—and what they don’t solve)
| Tool | Best at | Operational gap you still own | Notes |
|---|---|---|---|
| LangChain | Chains, tool calling patterns, integrations | Idempotency, durable state, access control | Great for prototyping; needs hard runtime boundaries in prod |
| LlamaIndex | RAG pipelines, connectors, indexing abstractions | Authorization to data, audit trails, data retention rules | Treat “retrieval” as a governed data product, not a library call |
| LangSmith | Tracing, debugging, dataset-based evaluation | Defining correctness for side effects and compensations | Visibility helps; it doesn’t define safe execution |
| Braintrust | Eval harnesses, prompt/model comparisons, scoring workflows | Real-world incident response and rollback design | Strong for measuring; production failures are often transactional, not linguistic |
| OpenAI API / Azure OpenAI Service | Managed model access, enterprise controls (esp. via Azure) | Your app’s authorization model and tool safety | Hosted models don’t remove your responsibility for execution correctness |
The teams that win treat orchestration frameworks like they treat a web framework: useful, but not a substitute for architecture.
The missing primitive: an “LLM write-ahead log” for side effects
If you want a clean mental model, steal one from databases: write-ahead logging. Before the system mutates anything, it records the intent and the planned steps.
For LLM systems, a practical version looks like this:
- Normalize input (strip secrets, attach tenant, attach user principal).
- Generate a plan as structured data (not prose), including proposed tool calls.
- Validate the plan against a schema and policy (allowed tools, allowed fields, amount limits, PII constraints).
- Persist the plan with a correlation ID (durable store).
- Execute tool calls in a runtime that enforces idempotency and retries.
- Persist outcomes as events, including failures and compensations.
Most teams do steps 2 and 5, then pray. The “persist the plan” step is the difference between a cool demo and an operable system.
A concrete pattern: typed tool calls plus policy checks
This is the kind of code you want: the model can propose, but code decides. Use JSON Schema or Pydantic models. Enforce policy in the executor, not in prompt text.
# Example: validate a model-proposed tool call before executing
# (Python-style pseudo-implementation using Pydantic)
from pydantic import BaseModel, Field, ValidationError
class RefundRequest(BaseModel):
order_id: str
amount_cents: int = Field(ge=0)
reason: str
idempotency_key: str
def policy_check(user, req: RefundRequest) -> None:
# enforce permissions and limits in code
if not user.can("refund:create"):
raise PermissionError("Not allowed")
if req.amount_cents > user.refund_limit_cents:
raise PermissionError("Amount exceeds limit")
def execute_refund(user, tool_payload: dict):
try:
req = RefundRequest(**tool_payload)
except ValidationError as e:
return {"status": "rejected", "error": str(e)}
policy_check(user, req)
# now call your payments/ledger service with idempotency_key
return payments.refund(order_id=req.order_id,
amount_cents=req.amount_cents,
idempotency_key=req.idempotency_key)
Notice what’s missing: “be careful” instructions. Guardrails are code and policy, not vibes.
LLMs are great at proposing actions. They’re terrible at being accountable for actions.
RAG isn’t your moat; governed retrieval is
By now, retrieval-augmented generation is standard. Everyone can chunk PDFs, embed them, and stuff top-k into a prompt. That stopped being interesting the day OpenAI, Anthropic, and Google made large-context models widely available and every vector database put “RAG” on the homepage.
The fight moved to governed retrieval:
- Entitlements: The retriever must respect per-user and per-group access, not just per-tenant.
- Freshness: Some data is only correct if it’s near-real-time (pricing, inventory, incidents). Stale context is a silent failure mode.
- Provenance: You need to know which doc chunk influenced an answer, and whether that chunk was approved.
- Retention: If your company has deletion obligations, your embeddings and caches are part of the data surface area.
- Prompt injection resistance: Treat retrieved text as untrusted input. Your system prompt is not a firewall.
Vendors can sell you vector search. They can’t sell you your org’s access model. That’s why governed retrieval ends up being a competitive differentiator, especially in B2B SaaS with complex roles.
What founders should build for in 2026: boring interfaces, strict state, and human override
If you’re building an AI-heavy product, your first design doc should read like a payments doc, not a chatbot doc.
Table 2: Transactional AI checklist (design-time decisions that prevent production incidents)
| Area | Decision to make | Concrete implementation | Failure mode it prevents |
|---|---|---|---|
| Identity | Who is the acting principal for each tool call? | OAuth with scoped tokens; service-to-service auth; per-tenant policies | “Agent did it” ambiguity; privilege escalation |
| State | Where does workflow state live? | Durable store + correlation IDs; event log of tool calls and outcomes | Irreproducible behavior; cannot audit or replay |
| Safety gates | What is allowed to mutate, and under what constraints? | Schema validation + policy engine checks before execution | Unexpected writes; prompt injection turning into actions |
| Idempotency | How do retries behave? | Idempotency keys per mutation; de-dup on server side | Double charges; duplicate tickets; repeated emails |
| Human override | Which actions require approval or review? | Queued actions; two-person rule for high-risk writes; explicit “review screen” | One-shot catastrophic operations |
A hard line that makes products better
Make the model’s job: propose. Make the system’s job: decide and execute. That division of labor yields three product benefits people underrate:
- Faster iteration: You can swap models, prompts, or retrieval strategies without rewriting the execution layer.
- Cheaper incidents: Failures are caught at validation gates, not after a side effect hits a customer.
- Better enterprise sales: Security and compliance teams understand policies, logs, and scopes. They do not understand “trust our prompt.”
And yes, it’s less sexy than a fully autonomous agent. That’s why it works.
A prediction worth testing this quarter
By the time you’re reading this in 2026, “agentic” features will be everywhere, and most of them will feel the same: chat UI, tool calls, some memory. The differentiator will be whether your product can safely do real work—writes, not words—without forcing humans to babysit every step.
Here’s the next action: pick one workflow in your product that currently ends as “draft text” (an email, a ticket summary, a plan). Convert it into a transactional workflow with exactly one guarded side effect (create the ticket, issue the refund, apply the config), with a durable log and idempotency. Ship that. Learn from the incident you don’t have.
If you can’t do that without fear, your problem isn’t model quality. Your problem is that you’re still building demos.