Everyone is building “AI features.” That’s not where the durable advantage is.
The durable advantage is building an AI control plane: the product layer that decides who can invoke AI, which model gets called, what data can be touched, how outputs are checked, and where every decision is recorded.
This is the same shift we watched with cloud. The winners didn’t just sprinkle VMs across the org; they built identity (Okta), policies (OPA), observability (Datadog), and cost controls (CloudHealth, FinOps practices) around it. AI is repeating the pattern—faster, with higher blast radius.
The new “platform tax” is paying for AI twice
Most teams in 2026 are already paying for AI twice:
- Once in product: a pile of prompts, a few model SDKs, and some retrieval glued into features.
- Once in operations: security reviews, privacy reviews, legal reviews, and incident response for every new AI workflow.
- Once in vendor sprawl: separate tools for redaction, guardrails, evaluation, logging, and routing.
- Once in people time: engineers debugging nondeterminism, PMs arguing about “quality,” support handling weird outputs, and security chasing data flows.
The repeated cost comes from a missing layer: a single system that treats AI calls as a governed runtime—like payments, auth, or data access—not a bag of libraries.
The control plane: what it is (and what it isn’t)
“Control plane” is an overloaded term, so be strict about scope. This is not a chatbot UI. It’s not “an agent framework.” It’s not a prompt library.
An AI control plane is the productized layer that sits between your applications and your AI providers (OpenAI, Anthropic, Google Gemini, AWS Bedrock, Azure OpenAI, open-source models you host). It standardizes the boring parts that become existential under pressure: identity, policy, routing, evaluation, logging, and audit.
The minimum surface area
If your “control plane” doesn’t do these, it’s not a control plane; it’s a convenience wrapper:
- Identity + authorization: service-to-service auth, per-team access, and environment separation.
- Policy enforcement: block or transform requests based on data class, user role, geography, and risk.
- Model routing: pick providers/models per use case, cost envelope, latency target, or safety tier.
- Evaluation + regression: test prompts/workflows against fixed datasets before shipping changes.
- Observability + audit: logs, traces, redaction, retention, and a chain of custody for outputs.
Why now: regulation and procurement finally caught up
The EU AI Act is real law, and it pushed AI risk management into the same category as privacy and security. In the US, the NIST AI Risk Management Framework (AI RMF) became the default language enterprise buyers use to ask uncomfortable questions. Even if you don’t sell into regulated markets, your customers do.
That means your AI stack is no longer “an implementation detail.” It’s part of your product posture. If you can’t answer basic questions—what model produced this, what data was used, what safety checks ran—you don’t have an AI product. You have a demo.
“If you can’t measure it, you can’t improve it.”
This line gets abused, but it applies cleanly here. Without evaluation and audit, “quality” becomes a Slack argument. With it, quality becomes a release gate.
The vendor map: pick your layer, not your favorite logo
Founders keep making the same mistake: they pick a single vendor and expect it to cover everything—models, orchestration, guardrails, evals, logging, compliance. No one vendor does. The useful question is: which layer do you want to own, and which layers do you want to buy?
Table 1: Comparison of common AI control-plane building blocks (real products, qualitative tradeoffs)
| Layer | Representative products | What it’s best at | What it won’t solve |
|---|---|---|---|
| Model API gateway & routing | OpenRouter, LiteLLM, Azure AI Foundry / Azure OpenAI, AWS Bedrock | Unifying model access; failover; basic policy hooks | Product-level evaluation strategy; domain-specific safety; enterprise audit narratives |
| App orchestration | LangChain, LlamaIndex, Microsoft Semantic Kernel | Composing tools, retrieval, and multi-step workflows | Governance across teams; cross-app policy enforcement |
| Observability & tracing | LangSmith, Arize Phoenix, Datadog LLM Observability | Tracing; prompt/version tracking; debugging failures | Hard blocks and redaction at the perimeter; access controls |
| Guardrails & policy | NVIDIA NeMo Guardrails, Guardrails AI, Lakera | Safety filters; jailbreak resistance patterns; content controls | End-to-end auditability; evaluation-as-a-gate in CI/CD by itself |
| Evaluation & test harness | OpenAI Evals, Ragas, DeepEval | Regression testing; quality checks; task-specific scoring | Runtime policy; enterprise logging/retention; access management |
Notice the shape: every category is strong at one thing and weak at the thing procurement and security teams care about most—consistent governance across every AI call.
A contrarian take: agents are a distraction until you can pass an audit
“Agents” are useful. They’re also a fantastic way to hide basic engineering debt under a new word.
An agent that can call tools, browse internal docs, and take actions in third-party systems is just an automated privileged user. Privileged users need controls. If your agent can create a Jira ticket, modify a Salesforce record, or trigger a deploy, you’ve built a production automation system. Treat it like one.
Key Takeaway
If your AI workflow can change state outside your app, the first feature is not “reasoning.” The first feature is an approval gate, an audit log, and a rollback story.
Control plane patterns that actually hold up
Here are patterns that survive contact with enterprise reality:
- Two-tier execution: “draft mode” outputs by default, “commit mode” only after explicit confirmation or policy checks.
- Tool permissions as scopes: treat each tool/function call like an OAuth scope; default to least privilege.
- Data classification at ingestion: label documents/fields (PII, PHI, secrets, regulated) and enforce policy before retrieval.
- Model tiering: small/cheap model for extraction and routing, stronger model only for tasks that need it.
- “No raw logs” rule: store traces with redaction, hashing, or field-level suppression; set retention intentionally.
Make AI shippable: evaluations as a release gate, not a research project
Teams keep treating evals as optional because they sound academic. They’re not. They’re the only way to ship AI changes without playing roulette.
OpenAI open-sourced Evals for a reason: model behavior changes, prompts change, retrieval corpora change, and your product changes. If you don’t pin expected behavior with tests, every deploy is a hidden product rewrite.
What a pragmatic eval suite looks like
You don’t need a massive benchmark. You need a small set of “this must never break” cases that represent your product’s contracts. A good suite has:
- Golden tasks: real inputs with expected outputs (or expected properties).
- Adversarial tasks: prompt injection attempts, policy-violating requests, and edge cases.
- Retrieval sanity checks: ensure the model cites or uses the right sources, not whatever is most semantically similar.
- Tool-call checks: verify the agent calls the correct tools with correct arguments, and doesn’t call forbidden tools.
A CI-shaped interface beats a dashboard-shaped interface
Dashboards are fine for exploration. Shipping requires something stricter: a command that returns pass/fail and artifacts you can inspect in a PR.
# Example: run an eval suite in CI (illustrative command structure)
# The point: one command, deterministic dataset, artifact output.
make eval \
EVAL_SET=golden_and_adversarial \
MODEL_PROVIDER=bedrock \
ARTIFACT_DIR=./artifacts/evals
# CI should fail if thresholds/criteria aren't met
# and upload artifacts (traces, diffs, scored outputs) for review.
The exact toolchain varies (OpenAI Evals, Ragas, DeepEval, bespoke). The product requirement doesn’t: evals must be easy to run, hard to ignore, and visible in the same place as the code change.
One control plane, many models: plan for churn as a product constraint
In 2026, model churn is normal. Providers ship new models, deprecate old ones, change safety behavior, change pricing, change rate limits. If your app talks directly to one provider everywhere, churn becomes a rewrite.
Routing isn’t a cost trick. It’s a product reliability feature.
Table 2: AI control-plane checklist mapped to real operational questions
| Capability | Concrete question it answers | Where it typically lives | Artifact you should be able to produce |
|---|---|---|---|
| Request policy + redaction | “Did any PII/secret leave our boundary?” | Gateway middleware; DLP hooks | Redaction rules; blocked-request logs; retention config |
| Model routing + fallback | “What happens if Provider A is down or rate-limited?” | Gateway/router service | Routing policy; failover traces; provider error reports |
| Prompt/workflow versioning | “Which prompt produced this output?” | Repo + release tags; prompt registry | Immutable version IDs tied to deploys |
| Evaluation gate | “What changed, and did quality regress?” | CI/CD pipeline | Eval report; failing cases; diffs with traces |
| Audit + incident workflow | “Can we reconstruct the chain of events for a bad output?” | Central log store; ticketing integration | Trace timeline; input/output redacted record; escalation ticket |
The product decision most teams avoid: where the boundary sits
You have to choose: is your control plane a shared internal platform (owned by infrastructure) or a product platform (owned by the product org, with infra partnership)? If it’s “everyone’s job,” it’ll be no one’s job. And you’ll keep paying the platform tax in every squad.
The right boundary is usually: infra owns the runtime, identity, and logging substrate; product owns the evaluation contract, safety policy requirements, and user-visible failure modes. That split matches incentives.
Where this goes next: AI controls become a selling point
In 2023–2025, many teams treated safety and governance as a procurement hurdle. In 2026, it’s turning into positioning. Customers are learning the hard way that “we use GPT-4/Claude/Gemini” says nothing about whether your system is controllable.
Expect RFPs to get blunt:
- “Show us your audit log for an AI-generated decision.”
- “Show us how you prevent prompt injection in retrieval workflows.”
- “Show us how you restrict tool use by role.”
- “Show us your evaluation suite and how it blocks regressions.”
If you can answer with artifacts, you win deals. If you answer with vibes, you lose them.
A concrete next move: build the “AI bill of materials” for one workflow
Pick one high-usage AI workflow in your product—support summarization, sales email drafting, code review comments, document Q&A—and produce an “AI bill of materials” for it in a single doc:
- Every external call (provider, endpoint, model), including embeddings.
- Every data source (which tables, which docs, which user fields), with a data-class label.
- Every tool/action the workflow can take, with an explicit allowlist.
- Your eval set (golden + adversarial), stored in the repo.
- Your audit record (what you log, what you redact, retention).
If you can’t write that document for one workflow, you don’t have a control plane problem—you have an ownership problem. Fix that first.
Then ask the question that actually matters for 2026 product teams: Would you trust your current AI stack if a regulator, an enterprise buyer, or a journalist asked you to reproduce a single bad output end-to-end?
Build until the answer is “yes,” with receipts.