Most “AI agents” you see in 2026 are not agents. They’re workflows with a language model stapled on top — and they fail in the same predictable way: they can’t reliably finish. They start strong, talk confidently, trigger a couple of APIs, then drift, loop, or quietly skip the hard step (the one that needed a real invariant).
The industry mistake is treating tool use as the finish line. It’s not. Tool use is the demo. The hard part is building systems that stay correct under partial failures, rate limits, schema drift, permission boundaries, and human review — without turning every run into a bespoke incident.
“The purpose of computing is insight, not numbers.” — Richard Hamming
Agents are the inverse problem: you want correct numbers (state, side effects, compliance), not vibes. Insight is cheap now. Side effects are expensive.
The uncomfortable truth: LLMs are not the product, the runtime is
Founders keep pitching “an agent that does X.” Engineers keep shipping a prompt plus a few tools. Operators keep inheriting a support queue of edge cases. The missing piece is a runtime that can make an LLM behave like a bounded, auditable, stoppable process.
Look at the direction of travel from the largest vendors and the open ecosystem:
- OpenAI pushed hard on function calling and structured outputs (because raw text is not a control plane).
- Anthropic made “tool use” and long-context reliability central in Claude releases, and positioned the model as something you wrap in policy and process.
- LangChain popularized agent patterns, then the community learned (the hard way) that unbounded agent loops are operational debt.
- LlamaIndex turned “RAG” into an engineering discipline: ingestion, chunking, retrieval, evaluation — not just prompting.
- Microsoft pushed Semantic Kernel as an orchestration layer; it’s an admission that prompts alone don’t compose into systems.
The contrarian position: the next wave of durable AI companies won’t be “model-first.” They’ll be runtime-first. The moat isn’t a secret prompt; it’s the set of constraints, state machines, evaluators, and audit trails that make the model safe to let near money, customers, or production infrastructure.
Stop building “agents.” Start building bounded workers with contracts.
If you want an LLM to operate in the real world, you need to treat it like an unreliable collaborator — brilliant at synthesis, weak at invariants — and wrap it with contracts it can’t talk its way around.
Three contracts that matter more than your model choice
1) A state contract: every run has an explicit state object. No hidden state in chat history. No “the model remembers.” Persist state in your database like you would any other workflow system.
2) A side-effect contract: all side effects are explicit, idempotent, and logged. “Send email” is not a string in a transcript; it’s a call with a request id, a dry-run mode, and a replay story.
3) An evaluation contract: you have a machine-checkable definition of “done” and “acceptable.” Not “sounds good.” This is where most teams give up — and where the winners get compounding advantage.
Key Takeaway
If you can’t write down your agent’s state model and idempotency story, you’re not building an agent. You’re building a slot machine with API keys.
The new stack: orchestration, tools, memory, evals — and a refusal to free-run
“Agent” became shorthand for “LLM picks tools.” That’s table stakes. The durable pattern is: orchestrator decides the allowed moves; model proposes; system verifies; tools execute; evaluators gate progress. The orchestrator — not the model — is in charge.
Table 1: Practical comparison of popular agent/orchestration approaches (2026 reality: mix and match)
| Layer | Representative options | Best at | Watch-outs |
|---|---|---|---|
| Orchestration | LangChain, LlamaIndex, Microsoft Semantic Kernel | Composing steps, tool routing, integrations | Easy to create sprawling chains; you still need strong state and eval discipline |
| Model gateway | OpenAI, Anthropic, Google (Gemini), AWS Bedrock, Azure OpenAI | Access to frontier models, managed scaling, policy controls | Vendor constraints, model churn; portability requires an abstraction layer |
| Tool execution | Internal microservices, serverless functions, Temporal (workflow engine) | Reliable retries, idempotency, long-running tasks | If you skip workflow primitives, you’ll reinvent them under outage pressure |
| Memory & retrieval | Postgres + pgvector, Elasticsearch, OpenSearch, Pinecone, Weaviate | RAG, semantic search, entity recall | Retrieval without evaluation yields confident wrong answers at scale |
| Evaluation & tracing | LangSmith, Arize Phoenix, Weights & Biases (LLM tracing), OpenTelemetry (general) | Debugging, regression tests, prompt/model comparisons | Teams instrument late; then “agent reliability” becomes folklore |
The point of the table isn’t to pick winners. It’s to force a design decision: are you building a chatbot that sometimes acts, or an operational system with a language interface? If it’s the second, you need workflow machinery (Temporal or equivalents), plus observability (traces, not transcripts), plus evaluation gates.
RAG is now a liability unless you treat it like a product
RAG moved from “smart hack” to default architecture. Good. Now the bad news: most teams still treat retrieval as a magic wand. They throw docs into a vector store, add top-k, and call it “enterprise-ready.” It’s not.
What breaks in production (and why founders underestimate it)
Ingestion drift: your data sources change structure. Confluence pages get reorganized. Google Drive permissions change. PDFs get replaced. If your ingestion pipeline isn’t monitored like a core service, your agent quietly starts hallucinating because the truth disappeared.
Semantic mismatch: embeddings retrieve “similar” text, not “authoritative” text. Similarity is not governance. Your retrieval layer must encode trust: canonical sources, freshness, and access policy.
Evaluation debt: you can’t fix what you don’t measure. If you don’t keep a test set of real questions and expected citations, your RAG system degrades without anyone noticing until a customer escalates.
Contrarian take: a lot of teams would ship a better product by using less RAG and more structured backends (SQL, APIs, curated knowledge graphs, explicit policies). LLMs are great at explaining, summarizing, and generating. They’re mediocre at being your source of truth.
# Minimal “bounded agent” loop sketch (Python-like pseudocode)
state = load_state(run_id)
while state.status not in {"DONE","FAILED"}:
plan = llm.propose_next_action(schema=AllowedActions, state=state)
if not policy.allows(plan, user=state.user):
state = state.fail("POLICY_BLOCK")
break
if plan.type == "TOOL_CALL":
result = tools.execute(plan.tool, plan.args, idempotency_key=state.step_id)
state = state.apply_result(result)
verdict = evals.check(state, requirements=AcceptanceCriteria)
if verdict == "ACCEPT":
state = state.done()
elif verdict == "NEEDS_HUMAN":
state = state.wait_for_review(queue="ops")
save_state(state)
This is the real work: explicit allowed actions, policy gates, idempotency keys, evals that can stop the run, and a clean handoff to humans.
Design for “human-in-the-loop” like you actually mean it
“Human-in-the-loop” became a slogan because teams realized agents can’t be trusted. But most implementations are performative: a single approval button at the end, after the agent already made irreversible calls.
Two review patterns that hold up
Pre-flight approval: the agent drafts a plan with explicit side effects (“create Zendesk ticket,” “refund order,” “rotate API key”), the human approves the plan, then the system executes deterministically. This is boring. It works.
Mid-flight checkpoints: the agent can proceed automatically until it hits a high-risk action. That requires risk scoring by action type and by resource (prod vs sandbox, finance vs marketing). Don’t pretend a single “are you sure?” dialog is governance.
Table 2: A practical checklist for shipping an agent that touches real systems
| Area | Non-negotiable | What to write down | Tooling examples |
|---|---|---|---|
| State | Explicit run state persisted outside the model | State schema, transitions, terminal states | Postgres, Temporal, Redis (for queues) |
| Side effects | Idempotency + audit log for every write | Idempotency keys, retry policy, rollback story | Temporal activities, Stripe idempotency keys (payments) |
| Permissions | Least privilege; no shared “agent admin” token | Scopes per tool, secrets rotation, impersonation rules | OAuth scopes, AWS IAM, GCP IAM, Vault |
| Evaluation | Automated acceptance checks, not vibes | Test set, pass/fail criteria, citation requirements | LangSmith, Arize Phoenix, custom unit tests |
| Observability | Traces across model + tools + workflow | Trace IDs, structured logs, error taxonomy | OpenTelemetry, Datadog, Honeycomb |
The business model shift founders miss: agents push you into services unless you productize reliability
An unreliable agent creates a hidden requirement: someone has to babysit it. If that someone is your team, congratulations — you built a services business with an LLM cost center. If that someone is your customer, churn will do the math for you.
The only escape is to productize reliability. That means:
- Choose narrow authority: one domain, one set of systems, one clear definition of “done.”
- Own the integration surface: fewer tools, higher quality connectors, strong schemas, versioned contracts.
- Make failure explicit: a run that stops and asks for help is a success. A run that lies is a defect.
- Ship evals like you ship tests: PRs that change prompts/tools should run regression suites.
- Sell the workflow, not the model: buyers pay for time saved and risk reduced, not “GPT-5 inside.”
This is why “agent wrappers” get competed into the ground. The model providers will keep improving tool use and structured output. Your differentiation has to live in the constraints, the data contracts, the operational hooks, and the workflow ownership.
A prediction worth building around: “Agent OS” becomes a category, and it won’t look like chat
The chat interface was a bridge. The durable interfaces for agentic systems will look like: queued work items, plans with diffs, execution logs, approvals, and traces. More Jira than ChatGPT. More CI than conversation.
So here’s a concrete next action: pick one agent project in your org and write a one-page spec that answers four questions with zero poetry:
- What is the state model (objects, transitions, terminal states)?
- What are the allowed side effects, and how are they made idempotent?
- What is the acceptance test (how do we know it’s correct)?
- Where do humans intervene (pre-flight, mid-flight, or post-flight), and why?
If you can’t answer those, don’t buy another model. Don’t add another tool. Build the runtime.