The quiet failure mode: your product is a vibe, not a system
Teams keep bragging about “agents” while shipping something closer to a haunted house: sometimes magical, sometimes broken, always hard to reason about. The recurring mistake isn’t model choice. It’s architectural laziness—treating a probabilistic text generator as if it were a deterministic subsystem.
LLMs are useful, but they’re not stable. Their outputs shift with minor prompt edits, hidden changes in hosted models, and the messy edge cases you only meet in production. If you ship prompts as product logic, you’re choosing random behavior as a feature.
So here’s the contrarian take: the competitive edge in 2026 isn’t “better prompting” or even “better fine-tuning.” It’s building deterministic systems around non-deterministic models: typed interfaces, constrained tools, explicit state, audit trails, and hard failure modes. The model becomes replaceable. The system becomes the moat.
Models will keep changing under you. Design like they will.
If you build on hosted models, you don’t control the underlying weights, safety layers, routing, or tool-use behaviors. That’s not paranoia; it’s the hosted AI business model. OpenAI, Google, and Anthropic iterate constantly. That iteration is good for the world—and destabilizing for any app whose logic is “the model will respond like it did last month.”
Even if you run open-weight models, you’re not free. Meta’s Llama ecosystem moves fast; so do inference stacks like vLLM and llama.cpp. Quantization choices change outputs. System prompts drift. Tokenizers differ. Small deltas become product bugs.
Founders hate hearing this because it sounds like “slow down.” It’s the opposite. Systems discipline is how you move fast without retraining your support team every time a model release lands.
Non-deterministic components demand deterministic boundaries. If you can’t explain what the model is allowed to do, you’re not building a product—you’re running an experiment.
Where teams get trapped
There are three common traps, all self-inflicted:
- Prompt-as-business-logic: pricing rules, eligibility logic, policy checks, or workflow routing expressed in prose.
- Tool soup: giving the model ten tools, no schema discipline, and hoping it “figures it out.”
- State amnesia: letting the model invent state (“I already sent that email”) because you didn’t model state explicitly.
Every one of these ends in the same place: brittle behavior, long debugging sessions, and a risk posture that scares serious buyers.
The 2026 stack is emerging: one model, many guardrails
Look at what serious teams are standardizing on: structured outputs, typed tool calls, retrieval with citations, traceability, evaluation harnesses, and policy enforcement outside the model. Not because it’s trendy—because it’s the only way to operate at scale.
OpenAI pushed the ecosystem toward tool calling and structured outputs; Anthropic emphasized tool use and controllability; Google baked LLMs into a broader platform with Vertex AI. In parallel, the open-source world filled in the missing pieces: Langfuse for traces, OpenTelemetry for observability, vLLM for serving, and a growing set of eval tools (including OpenAI Evals and EleutherAI’s lm-evaluation-harness) to stop arguing from vibes.
Table 1: Comparison of common “agent” building blocks (what they’re actually good for)
| Component | Best use | Failure mode if misused | Practical guardrail |
|---|---|---|---|
| Tool/function calling (OpenAI, Anthropic) | Constrained actions with typed inputs | Model hallucinates arguments or selects wrong tool | JSON schema validation + allowlist + retries with critique |
| RAG (vector search + citations) | Grounded answers over proprietary docs | Retrieves irrelevant chunks; confident wrong answers | Query rewriting + re-ranking + “must cite sources” policy |
| Fine-tuning (OpenAI, Google Vertex AI) | Style, domain phrasing, narrow formats | Bakes in outdated policy; hides errors behind fluency | Keep policy outside the model; re-train on schedule |
| Agent frameworks (LangChain, LlamaIndex) | Rapid prototyping of multi-step flows | Opaque chains; debugging via guesswork | Tracing (Langfuse) + explicit state machine for prod |
| Workflow engines (Temporal, AWS Step Functions) | Durable execution, retries, compensation | Overhead if used for simple chat | Use for “does stuff” agents; keep chat lightweight |
Key Takeaway
If your “agent” can’t produce an audit trail a security team can review, it’s a demo. A product has logs, schemas, invariants, and clear ownership of state.
The missing layer: policy and invariants outside the model
Most “AI safety” discussions are abstract. Operators need something concrete: invariants. Invariants are rules the system enforces regardless of model output. Think: “never email an external domain without approval,” “never execute SQL without parameterization,” “never transfer money,” “never delete a record without a soft-delete.”
Put invariants in code, not in prompts. Prompts are documentation at best.
“Agents” that work are just state machines with an LLM in the loop
Here’s a useful reframe that removes most of the mystique: a production agent is a state machine (or workflow) where one transition function happens to be an LLM call. Everything else—tools, permissions, retries, approvals, idempotency—is standard distributed systems engineering.
Temporal became popular for microservices because it makes distributed workflows debuggable and durable. Those same properties matter more when one step is a model that may misunderstand context or produce invalid output. If your agent can take actions, you want durable execution and replayability. That’s Temporal’s whole thing.
A concrete pattern: “plan → propose → verify → act”
Not as a cute slogan. As an execution contract.
- Plan: the model proposes a sequence of steps in a constrained format.
- Propose: for each step, it proposes a tool call with typed arguments.
- Verify: deterministic checks validate schema, permissions, rate limits, and business invariants; optional second-model critique.
- Act: the system executes tool calls; results are written to state; the model can only read state, not invent it.
Yes, this reduces the “magic.” It also makes the system operable.
What this looks like in practice (minimal, but real)
Below is a tiny example using a JSON Schema validation step. The point isn’t the library—it’s the discipline: the model doesn’t get to decide what valid output means.
import json
from jsonschema import validate
TOOL_CALL_SCHEMA = {
"type": "object",
"properties": {
"tool": {"type": "string", "enum": ["create_ticket", "send_email"]},
"args": {"type": "object"}
},
"required": ["tool", "args"],
"additionalProperties": False
}
def parse_tool_call(model_text: str):
payload = json.loads(model_text)
validate(instance=payload, schema=TOOL_CALL_SCHEMA)
return payload
You can swap the model, prompt, or vendor. The schema and invariants stay. That’s the point.
Tooling maturity is the real platform war
The model labs want you to believe the battle is model quality. Operators should care more about: evals, tracing, access control, and predictable tool use. That’s where costs and incidents come from.
Microsoft’s GitHub Copilot succeeded not because it was the first code model, but because it shipped inside the workflow developers already live in (VS Code, GitHub) and kept getting operational polish. The lesson transfers: AI features win when they fit the stack and can be governed.
Two worlds: chat apps vs. action apps
Most teams build “chat apps” and call them agents. Action apps are different. If the system can change state in the real world—create invoices, modify infrastructure, message customers—you need controls that look like classic production software controls.
- Identity: every action tied to a user, service account, or delegated token
- Authorization: explicit permission checks per tool
- Audit: immutable logs of prompts, retrieved context, tool calls, results
- Rate limiting: per user, per tool, per workflow
- Human gates: approval steps for high-risk actions
Table 2: Production checklist for LLM-in-the-loop systems (what to implement before you scale usage)
| Area | Minimum bar | Good | Strong |
|---|---|---|---|
| Outputs | Structured JSON for any action | Schema validation + retries | Versioned contracts per tool + compatibility tests |
| State | Server-side state store | Idempotency keys for tool calls | Durable workflows (Temporal / Step Functions) + replay |
| Observability | Request logs | Traces for prompt → retrieval → tool calls | OpenTelemetry integration + redaction + retention policy |
| Quality | Golden test prompts | Automated eval harness (e.g., OpenAI Evals) | Task-specific evals + regression gates in CI |
| Governance | Basic PII redaction | Per-tool authorization + allowlists | Policy-as-code + human approval for risky transitions |
A prediction worth building around: “model choice” stops being a strategy
In 2023–2025, picking a model looked like strategy because capability jumps were visible to end users. By 2026, the difference between “usable” and “best” models matters less than whether your system is governable. Buyers will assume models improve. They won’t assume your workflows are safe.
That’s why the real platform war is shifting toward the control plane: who gives operators the best tracing, evals, policy enforcement, and cost controls. Cloud vendors (AWS, Microsoft, Google) are structurally advantaged here because they already own identity, logging, and governance primitives. The model labs are racing to catch up with enterprise features. Open-source will keep winning where you need inspectability and custom control, but it will cost you operational burden.
So the action item isn’t “pick the right model.” It’s this: write down your system invariants and build the smallest enforcement layer that makes them true even if the model behaves badly. Then wire evals into CI so you can change prompts, retrieval, or models without praying.
A concrete next action
Pick one workflow where your LLM can cause real damage (emails, tickets, refunds, infra changes). Add (1) a typed tool contract, (2) schema validation, (3) an immutable audit log, and (4) a “deny by default” permission check. If that sounds like too much work, your agent isn’t ready to take actions.
Sit with one question before you ship your next “agent”: if a regulator, customer, or incident reviewer asked “why did the system do that?”, do you have an answer that isn’t “the model decided”?