The most expensive mistake teams still make with LLM products is treating retrieval-augmented generation (RAG) like it’s the product. You’ll hear: “We’re building RAG over our docs.” That’s not a strategy. That’s table stakes plumbing — and it’s quickly commoditizing.
What’s replacing it is less comfortable: runtime context engineering and tool contracts. The competitive edge is shifting from “can you retrieve passages?” to “can you reliably compose actions, permissions, and state across messy systems — and prove it with evals?”
That shift is already visible in public product moves: OpenAI pushing function calling and the Assistants API concept into mainstream developer workflows; Anthropic centering tool use and long-context reasoning; Google shipping Gemini models tightly integrated with Workspace; Microsoft embedding Copilot across Microsoft 365; Amazon wiring generative experiences into AWS with Bedrock and Agents for Amazon Bedrock; and open-source ecosystems (like LangChain and LlamaIndex) moving from “RAG frameworks” toward agent orchestration, tracing, and evaluation integrations.
RAG solved the wrong problem — and then everyone copied it
RAG was a rational response to a real constraint: LLMs don’t know your private data and they hallucinate. The early play was: index documents, retrieve top-k chunks, stuff them into context, ask the model to answer “grounded” in those chunks. For a while, that worked well enough to ship.
But RAG has two structural limits that don’t go away with more embeddings:
First: retrieval is not the same thing as “using your business.” Most valuable workflows aren’t Q&A. They’re actions: change a price, renew a contract, re-route a shipment, open a Jira ticket, grant a refund, push a config, generate an invoice, escalate an on-call incident. Those require tool execution, permissioning, audit logs, and deterministic constraints.
Second: long-context models and better instruction following reduce the perceived pain of missing knowledge, which means the differentiator moves elsewhere. Models from OpenAI, Anthropic, Google, and others have pushed context windows up over time; the market response has been predictable: teams stuff more into context and call it a day. It works until it doesn’t — and “doesn’t” usually means a subtle failure in a real workflow.
RAG makes demos look smart. Tool contracts make products safe.
Operators feel this in production as a recurring pattern: the model answers correctly in a sandbox, then fails on edge cases where the business actually bleeds — stale entitlements, conflicting records, odd calendar exceptions, partial refunds, multi-entity permissions, regional tax rules, rate limits, and idempotency. Retrieval didn’t fix those. It never could.
The 2026 wedge: runtime context, not static knowledge
Founders keep asking, “How do we get the model to know our business?” The better question is: “How do we get the model to operate our business safely?” That’s a runtime problem, not a knowledge problem.
Runtime context is everything the model needs at the moment of action — not a doc dump. It includes identity, entitlements, current state, recent events, and the narrowest possible slice of data required to decide the next step. Think: a structured bundle with explicit provenance.
Three context layers that matter (and one that doesn’t)
1) Identity + permissions: Who is asking? What are they allowed to do? This is where most “agent” products get reckless. If your LLM can trigger workflows without a strong permission boundary, you don’t have an AI feature — you have a new attack surface.
2) Operational state: The truth is in systems of record, not in PDFs. The current subscription status, the current inventory level, the current incident severity, the current account owner — these should arrive as structured fields pulled at runtime, not via fuzzy retrieval from documentation.
3) Policies + constraints: The model needs rules expressed in a way that can be checked. Some constraints should be enforced outside the model (e.g., “cannot refund over $X without approval,” “cannot access HR records unless HR role”). Treat the model as fallible and enforce invariants elsewhere.
The layer that doesn’t matter as much as people think: a giant general-purpose embedding index of “all company docs.” You still need search. You still need retrieval. But once every vendor has decent embedding models and vector search, your index is not your moat.
Key Takeaway
In 2026, “context” that isn’t tied to identity, permissions, and system-of-record state is mostly theater. Build a runtime that can fetch, constrain, and audit — then let the model reason inside that box.
Tool contracts are the new API design problem
Function calling (OpenAI popularized the pattern for mainstream developers) turned “prompting” into something closer to programming: the model decides when to call a tool and emits structured arguments. Anthropic, Google, and others have their own tooling patterns, but the direction is consistent: models are being trained to use tools.
Here’s the contrarian point: most teams design tools like they’re designing internal microservices. That’s backwards. You’re designing a contract for a probabilistic caller.
What makes a good tool contract for an LLM
- Few parameters, strongly typed: Every optional field becomes a new failure mode.
- Idempotent by default: Retries will happen. Your tool should tolerate it.
- Clear error semantics: Not “500.” Give the model a stable error code and a human-readable message that tells it what to do next.
- Separation of “dry run” vs “commit”: Let the model preview impact before executing irreversible actions.
- Audit-first outputs: Return what changed, what was read, and what policy gates were checked.
If you implement nothing else: add a dry-run mode and enforce idempotency. That single move saves real money and real incidents.
The stack is reorganizing: models, orchestration, retrieval, tracing, evals
The 2023–2024 “LLM app stack” story was overly centered on RAG frameworks and vector databases. By 2026, the center of gravity is observability and evaluation. Not because it’s sexy — because once your assistant can take actions, you need to know what it did and why.
Teams are converging on a handful of real, public building blocks:
- Model providers: OpenAI, Anthropic, Google, AWS (Bedrock as a gateway to multiple models), and open-source models served via vLLM, TGI, or managed hosts.
- Orchestration frameworks: LangChain and LlamaIndex remain common, increasingly paired with provider-native agent tooling (e.g., Agents for Amazon Bedrock) depending on where the app lives.
- Vector search + hybrid search: Pinecone, Weaviate, Milvus, Elasticsearch/OpenSearch vector capabilities — plus plain keyword search still doing quiet, critical work.
- Tracing and prompt/agent observability: LangSmith (LangChain), Arize Phoenix, Weights & Biases Weave, OpenTelemetry-style tracing patterns adopted into LLM apps.
- Evals and red teaming: Model-graded evals (used carefully), deterministic tests for tool calls, and adversarial prompt suites. Many teams use a mix of open tooling plus internal harnesses.
Table 1: Practical comparison of common LLM app stack components (as used in real deployments)
| Layer | Options (real products) | Where it shines | Watch-outs |
|---|---|---|---|
| Model API | OpenAI, Anthropic, Google Gemini, AWS Bedrock | Fast iteration; strong baseline reasoning; managed infra | Provider quirks; cost controls; data retention settings; model churn |
| Orchestration | LangChain, LlamaIndex, provider-native agents (e.g., Agents for Amazon Bedrock) | Tool routing; memory patterns; connectors; faster prototyping | Abstraction tax; hard-to-debug chains; version drift |
| Retrieval | Pinecone, Weaviate, Milvus, Elasticsearch/OpenSearch vectors | Semantic + hybrid search; scaling indexes; metadata filtering | Chunking pitfalls; stale indexes; permission filtering complexity |
| Tracing/observability | LangSmith, Arize Phoenix, W&B Weave | Debugging tool calls; prompt/version tracking; failure forensics | Sensitive data handling; noisy traces; unclear ownership |
| Evals | Internal harness + open tooling; red-team suites; regression tests | Preventing silent regressions; gating releases; safety checks | Model-graded eval brittleness; reward hacking; dataset staleness |
Stop worshipping “agents.” Start shipping constrained autonomy
“Agent” became a marketing term. In practice, the products that survive are not autonomous. They’re constrained.
Constrained autonomy means: the model can propose plans, select tools, and draft changes — but only within explicit policy gates, deterministic tool contracts, and observable traces. Humans are in the loop where it matters, and out of the loop where it’s safe.
A release pattern that actually holds up in production
- Read-only assistant: search + answer with citations; no actions.
- Draft mode: the assistant generates a proposed action (email, ticket, config diff, SQL) but cannot execute.
- Scoped execution: allow execution only on low-risk tools (create a Jira ticket, schedule a meeting, start a workflow) with tight permissions.
- Policy-gated execution: expand to higher-risk actions with explicit approvals and logging.
- Continuous eval gating: no prompt/model/tool change ships without passing regression tests on your own failure corpus.
This isn’t conservative. It’s how you avoid the predictable failure mode: “We shipped an agent” turning into “We created a compliance incident.”
# Example: tool contract sketch (JSON Schema style) for an LLM-called refund tool
{
"name": "issue_refund",
"description": "Issue a refund for a completed charge. Dry-run supported.",
"parameters": {
"type": "object",
"properties": {
"charge_id": {"type": "string"},
"amount": {"type": "string", "description": "Decimal as string to avoid float errors"},
"currency": {"type": "string", "enum": ["USD", "EUR", "GBP"]},
"reason": {"type": "string"},
"dry_run": {"type": "boolean", "default": true},
"idempotency_key": {"type": "string"}
},
"required": ["charge_id", "amount", "currency", "reason", "idempotency_key"]
}
}
The unglamorous differentiator: evals tied to business outcomes
Most teams still treat evaluation like a research chore: a few curated prompts, a “looks good” review, then ship. That approach collapses the moment a provider updates a model, your prompt template changes, or your tool API returns a slightly different shape.
By 2026, serious operators are building eval suites the way they build test suites — and they tie them to the exact places the business can get hurt:
- Permission tests: verify the assistant refuses access across roles and tenants.
- Tool correctness: validate tool arguments, idempotency behavior, and error recovery paths.
- Grounding checks: ensure answers cite available sources when required and avoid fabricating specifics.
- Regression on failure corpus: every real incident becomes a test case.
- Prompt injection drills: test “ignore instructions” attacks against your retrieval and tool routing paths.
Tools like Arize Phoenix and LangSmith exist because teams hit this wall. They make it easier to trace, label, and run evals. They don’t remove the hard part: you still need to decide what “correct” means for your business and encode it.
Table 2: A deployment checklist for constrained autonomy (runtime context + tool contracts + eval gating)
| Area | Minimum bar | Evidence to collect | Common failure mode |
|---|---|---|---|
| Identity & auth | Every request bound to a user/role; tenant isolation | Audit log entries include actor, tenant, tool called | “Assistant” has shared super-user credentials |
| Runtime context | Fetch system-of-record state on demand; minimize context | Trace shows sources: APIs called, records read | Stuffing stale docs into context and trusting it |
| Tool contracts | Typed inputs; dry-run; idempotency; explicit errors | Tool call logs with args, results, error codes | Over-flexible tools that accept ambiguous payloads |
| Safety gates | Approval flows for high-risk actions; rate limits | Policy decision records (allow/deny + reason) | Model decides policy instead of code deciding policy |
| Evals & releases | Regression tests gate prompt/model/tool changes | Eval runs tied to git SHAs; failing cases tracked | Silent regressions after “small” prompt tweaks |
A sharp prediction: the moat moves to “governed execution,” not “smart answers”
By 2026, “smart answers” will be cheap. Every product will have a chat box. Every SaaS vendor will ship an assistant trained on help docs and configured with some retrieval. That layer becomes like search: expected, rarely differentiated.
The moat will be governed execution: assistants that can safely operate inside real systems with permissions, auditability, and predictable behavior under failure. The companies that win won’t brag about their vector database. They’ll brag — quietly — about their change management and eval discipline.
If you’re building right now, do one thing this week: pick a single high-frequency workflow where the assistant can propose and execute a small action. Write the tool contract. Add dry-run and idempotency. Then build ten tests from the ugliest edge cases your operators complain about. Ship only after those tests run automatically. If that sounds like “slowing down,” good. You’re finally moving at production speed.
The question worth sitting with: what is the first action your AI can take in your product that you’re willing to audit in court?