Technology
7 min read

Stop Shipping LLM Prompts. Ship Deterministic Systems Around Them.

The winners in 2026 won’t be the teams with the best model. They’ll be the teams who treat LLMs like unreliable components—and engineer the rest like it matters.

Stop Shipping LLM Prompts. Ship Deterministic Systems Around Them.

The quiet failure mode: your product is a vibe, not a system

Teams keep bragging about “agents” while shipping something closer to a haunted house: sometimes magical, sometimes broken, always hard to reason about. The recurring mistake isn’t model choice. It’s architectural laziness—treating a probabilistic text generator as if it were a deterministic subsystem.

LLMs are useful, but they’re not stable. Their outputs shift with minor prompt edits, hidden changes in hosted models, and the messy edge cases you only meet in production. If you ship prompts as product logic, you’re choosing random behavior as a feature.

So here’s the contrarian take: the competitive edge in 2026 isn’t “better prompting” or even “better fine-tuning.” It’s building deterministic systems around non-deterministic models: typed interfaces, constrained tools, explicit state, audit trails, and hard failure modes. The model becomes replaceable. The system becomes the moat.

server racks and network equipment representing production constraints
Production reliability is a systems problem, not a prompt-writing contest.

Models will keep changing under you. Design like they will.

If you build on hosted models, you don’t control the underlying weights, safety layers, routing, or tool-use behaviors. That’s not paranoia; it’s the hosted AI business model. OpenAI, Google, and Anthropic iterate constantly. That iteration is good for the world—and destabilizing for any app whose logic is “the model will respond like it did last month.”

Even if you run open-weight models, you’re not free. Meta’s Llama ecosystem moves fast; so do inference stacks like vLLM and llama.cpp. Quantization choices change outputs. System prompts drift. Tokenizers differ. Small deltas become product bugs.

Founders hate hearing this because it sounds like “slow down.” It’s the opposite. Systems discipline is how you move fast without retraining your support team every time a model release lands.

Non-deterministic components demand deterministic boundaries. If you can’t explain what the model is allowed to do, you’re not building a product—you’re running an experiment.

Where teams get trapped

There are three common traps, all self-inflicted:

  • Prompt-as-business-logic: pricing rules, eligibility logic, policy checks, or workflow routing expressed in prose.
  • Tool soup: giving the model ten tools, no schema discipline, and hoping it “figures it out.”
  • State amnesia: letting the model invent state (“I already sent that email”) because you didn’t model state explicitly.

Every one of these ends in the same place: brittle behavior, long debugging sessions, and a risk posture that scares serious buyers.

engineer working with hardware and instruments representing disciplined engineering
If you wouldn’t accept “it usually works” in payments or auth, don’t accept it in agent workflows.

The 2026 stack is emerging: one model, many guardrails

Look at what serious teams are standardizing on: structured outputs, typed tool calls, retrieval with citations, traceability, evaluation harnesses, and policy enforcement outside the model. Not because it’s trendy—because it’s the only way to operate at scale.

OpenAI pushed the ecosystem toward tool calling and structured outputs; Anthropic emphasized tool use and controllability; Google baked LLMs into a broader platform with Vertex AI. In parallel, the open-source world filled in the missing pieces: Langfuse for traces, OpenTelemetry for observability, vLLM for serving, and a growing set of eval tools (including OpenAI Evals and EleutherAI’s lm-evaluation-harness) to stop arguing from vibes.

Table 1: Comparison of common “agent” building blocks (what they’re actually good for)

ComponentBest useFailure mode if misusedPractical guardrail
Tool/function calling (OpenAI, Anthropic)Constrained actions with typed inputsModel hallucinates arguments or selects wrong toolJSON schema validation + allowlist + retries with critique
RAG (vector search + citations)Grounded answers over proprietary docsRetrieves irrelevant chunks; confident wrong answersQuery rewriting + re-ranking + “must cite sources” policy
Fine-tuning (OpenAI, Google Vertex AI)Style, domain phrasing, narrow formatsBakes in outdated policy; hides errors behind fluencyKeep policy outside the model; re-train on schedule
Agent frameworks (LangChain, LlamaIndex)Rapid prototyping of multi-step flowsOpaque chains; debugging via guessworkTracing (Langfuse) + explicit state machine for prod
Workflow engines (Temporal, AWS Step Functions)Durable execution, retries, compensationOverhead if used for simple chatUse for “does stuff” agents; keep chat lightweight

Key Takeaway

If your “agent” can’t produce an audit trail a security team can review, it’s a demo. A product has logs, schemas, invariants, and clear ownership of state.

The missing layer: policy and invariants outside the model

Most “AI safety” discussions are abstract. Operators need something concrete: invariants. Invariants are rules the system enforces regardless of model output. Think: “never email an external domain without approval,” “never execute SQL without parameterization,” “never transfer money,” “never delete a record without a soft-delete.”

Put invariants in code, not in prompts. Prompts are documentation at best.

code on a screen representing structured interfaces and validation
Structured outputs and validation turn model text into something you can operate.

“Agents” that work are just state machines with an LLM in the loop

Here’s a useful reframe that removes most of the mystique: a production agent is a state machine (or workflow) where one transition function happens to be an LLM call. Everything else—tools, permissions, retries, approvals, idempotency—is standard distributed systems engineering.

Temporal became popular for microservices because it makes distributed workflows debuggable and durable. Those same properties matter more when one step is a model that may misunderstand context or produce invalid output. If your agent can take actions, you want durable execution and replayability. That’s Temporal’s whole thing.

A concrete pattern: “plan → propose → verify → act”

Not as a cute slogan. As an execution contract.

  1. Plan: the model proposes a sequence of steps in a constrained format.
  2. Propose: for each step, it proposes a tool call with typed arguments.
  3. Verify: deterministic checks validate schema, permissions, rate limits, and business invariants; optional second-model critique.
  4. Act: the system executes tool calls; results are written to state; the model can only read state, not invent it.

Yes, this reduces the “magic.” It also makes the system operable.

What this looks like in practice (minimal, but real)

Below is a tiny example using a JSON Schema validation step. The point isn’t the library—it’s the discipline: the model doesn’t get to decide what valid output means.

import json
from jsonschema import validate

TOOL_CALL_SCHEMA = {
  "type": "object",
  "properties": {
    "tool": {"type": "string", "enum": ["create_ticket", "send_email"]},
    "args": {"type": "object"}
  },
  "required": ["tool", "args"],
  "additionalProperties": False
}

def parse_tool_call(model_text: str):
  payload = json.loads(model_text)
  validate(instance=payload, schema=TOOL_CALL_SCHEMA)
  return payload

You can swap the model, prompt, or vendor. The schema and invariants stay. That’s the point.

developer workstation showing software development and debugging
The hard part of agent ops is debugging and accountability, not “getting it to respond.”

Tooling maturity is the real platform war

The model labs want you to believe the battle is model quality. Operators should care more about: evals, tracing, access control, and predictable tool use. That’s where costs and incidents come from.

Microsoft’s GitHub Copilot succeeded not because it was the first code model, but because it shipped inside the workflow developers already live in (VS Code, GitHub) and kept getting operational polish. The lesson transfers: AI features win when they fit the stack and can be governed.

Two worlds: chat apps vs. action apps

Most teams build “chat apps” and call them agents. Action apps are different. If the system can change state in the real world—create invoices, modify infrastructure, message customers—you need controls that look like classic production software controls.

  • Identity: every action tied to a user, service account, or delegated token
  • Authorization: explicit permission checks per tool
  • Audit: immutable logs of prompts, retrieved context, tool calls, results
  • Rate limiting: per user, per tool, per workflow
  • Human gates: approval steps for high-risk actions

Table 2: Production checklist for LLM-in-the-loop systems (what to implement before you scale usage)

AreaMinimum barGoodStrong
OutputsStructured JSON for any actionSchema validation + retriesVersioned contracts per tool + compatibility tests
StateServer-side state storeIdempotency keys for tool callsDurable workflows (Temporal / Step Functions) + replay
ObservabilityRequest logsTraces for prompt → retrieval → tool callsOpenTelemetry integration + redaction + retention policy
QualityGolden test promptsAutomated eval harness (e.g., OpenAI Evals)Task-specific evals + regression gates in CI
GovernanceBasic PII redactionPer-tool authorization + allowlistsPolicy-as-code + human approval for risky transitions

A prediction worth building around: “model choice” stops being a strategy

In 2023–2025, picking a model looked like strategy because capability jumps were visible to end users. By 2026, the difference between “usable” and “best” models matters less than whether your system is governable. Buyers will assume models improve. They won’t assume your workflows are safe.

That’s why the real platform war is shifting toward the control plane: who gives operators the best tracing, evals, policy enforcement, and cost controls. Cloud vendors (AWS, Microsoft, Google) are structurally advantaged here because they already own identity, logging, and governance primitives. The model labs are racing to catch up with enterprise features. Open-source will keep winning where you need inspectability and custom control, but it will cost you operational burden.

So the action item isn’t “pick the right model.” It’s this: write down your system invariants and build the smallest enforcement layer that makes them true even if the model behaves badly. Then wire evals into CI so you can change prompts, retrieval, or models without praying.

A concrete next action

Pick one workflow where your LLM can cause real damage (emails, tickets, refunds, infra changes). Add (1) a typed tool contract, (2) schema validation, (3) an immutable audit log, and (4) a “deny by default” permission check. If that sounds like too much work, your agent isn’t ready to take actions.

Sit with one question before you ship your next “agent”: if a regulator, customer, or incident reviewer asked “why did the system do that?”, do you have an answer that isn’t “the model decided”?

Michael Chang

Written by

Michael Chang

Editor-at-Large

Michael is ICMD's editor-at-large, covering the intersection of technology, business, and culture. A former technology journalist with 18 years of experience, he has covered the tech industry for publications including Wired, The Verge, and TechCrunch. He brings a journalist's eye for clarity and narrative to complex technology and business topics, making them accessible to founders and operators at every level.

Technology Journalism Developer Relations Industry Analysis Narrative Writing
View all articles by Michael Chang →

Deterministic LLM System Checklist (Action Apps)

A practical checklist to turn an LLM feature into a governable, testable system with typed interfaces, invariants, evals, and audit trails.

Download Free Resource

Format: .txt | Direct download

More in Technology

View all →
Read ICMD on Google

Get more ICMD in your Google Search results

Add ICMD as a preferred source and our latest articles, guides, and analysis show up higher when you search on Google.

ICMD. Add as a preferred source on Google