Stop Shipping “AI Features.” Ship an AI Control Plane.

Everyone is building “AI features.” That’s not where the durable advantage is.

The durable advantage is building an AI control plane: the product layer that decides who can invoke AI, which model gets called, what data can be touched, how outputs are checked, and where every decision is recorded.

This is the same shift we watched with cloud. The winners didn’t just sprinkle VMs across the org; they built identity (Okta), policies (OPA), observability (Datadog), and cost controls (CloudHealth, FinOps practices) around it. AI is repeating the pattern—faster, with higher blast radius.

The new “platform tax” is paying for AI twice

Most teams in 2026 are already paying for AI twice:

Once in product: a pile of prompts, a few model SDKs, and some retrieval glued into features.
Once in operations: security reviews, privacy reviews, legal reviews, and incident response for every new AI workflow.
Once in vendor sprawl: separate tools for redaction, guardrails, evaluation, logging, and routing.
Once in people time: engineers debugging nondeterminism, PMs arguing about “quality,” support handling weird outputs, and security chasing data flows.

The repeated cost comes from a missing layer: a single system that treats AI calls as a governed runtime—like payments, auth, or data access—not a bag of libraries.

operators monitoring production systems and dashboards — AI in production behaves like a distributed system: you need controls, not just clever prompts.

The control plane: what it is (and what it isn’t)

“Control plane” is an overloaded term, so be strict about scope. This is not a chatbot UI. It’s not “an agent framework.” It’s not a prompt library.

An AI control plane is the productized layer that sits between your applications and your AI providers (OpenAI, Anthropic, Google Gemini, AWS Bedrock, Azure OpenAI, open-source models you host). It standardizes the boring parts that become existential under pressure: identity, policy, routing, evaluation, logging, and audit.

The minimum surface area

If your “control plane” doesn’t do these, it’s not a control plane; it’s a convenience wrapper:

Identity + authorization: service-to-service auth, per-team access, and environment separation.
Policy enforcement: block or transform requests based on data class, user role, geography, and risk.
Model routing: pick providers/models per use case, cost envelope, latency target, or safety tier.
Evaluation + regression: test prompts/workflows against fixed datasets before shipping changes.
Observability + audit: logs, traces, redaction, retention, and a chain of custody for outputs.

Why now: regulation and procurement finally caught up

The EU AI Act is real law, and it pushed AI risk management into the same category as privacy and security. In the US, the NIST AI Risk Management Framework (AI RMF) became the default language enterprise buyers use to ask uncomfortable questions. Even if you don’t sell into regulated markets, your customers do.

That means your AI stack is no longer “an implementation detail.” It’s part of your product posture. If you can’t answer basic questions—what model produced this, what data was used, what safety checks ran—you don’t have an AI product. You have a demo.

“If you can’t measure it, you can’t improve it.”

—Peter Drucker

This line gets abused, but it applies cleanly here. Without evaluation and audit, “quality” becomes a Slack argument. With it, quality becomes a release gate.

The vendor map: pick your layer, not your favorite logo

Founders keep making the same mistake: they pick a single vendor and expect it to cover everything—models, orchestration, guardrails, evals, logging, compliance. No one vendor does. The useful question is: which layer do you want to own, and which layers do you want to buy?

Table 1: Comparison of common AI control-plane building blocks (real products, qualitative tradeoffs)

Layer	Representative products	What it’s best at	What it won’t solve
Model API gateway & routing	OpenRouter, LiteLLM, Azure AI Foundry / Azure OpenAI, AWS Bedrock	Unifying model access; failover; basic policy hooks	Product-level evaluation strategy; domain-specific safety; enterprise audit narratives
App orchestration	LangChain, LlamaIndex, Microsoft Semantic Kernel	Composing tools, retrieval, and multi-step workflows	Governance across teams; cross-app policy enforcement
Observability & tracing	LangSmith, Arize Phoenix, Datadog LLM Observability	Tracing; prompt/version tracking; debugging failures	Hard blocks and redaction at the perimeter; access controls
Guardrails & policy	NVIDIA NeMo Guardrails, Guardrails AI, Lakera	Safety filters; jailbreak resistance patterns; content controls	End-to-end auditability; evaluation-as-a-gate in CI/CD by itself
Evaluation & test harness	OpenAI Evals, Ragas, DeepEval	Regression testing; quality checks; task-specific scoring	Runtime policy; enterprise logging/retention; access management

Notice the shape: every category is strong at one thing and weak at the thing procurement and security teams care about most—consistent governance across every AI call.

team discussing architecture choices and tradeoffs — If your AI stack diagram looks like a bowl of spaghetti, your cost and risk will follow the same pattern.

A contrarian take: agents are a distraction until you can pass an audit

“Agents” are useful. They’re also a fantastic way to hide basic engineering debt under a new word.

An agent that can call tools, browse internal docs, and take actions in third-party systems is just an automated privileged user. Privileged users need controls. If your agent can create a Jira ticket, modify a Salesforce record, or trigger a deploy, you’ve built a production automation system. Treat it like one.

Key Takeaway

If your AI workflow can change state outside your app, the first feature is not “reasoning.” The first feature is an approval gate, an audit log, and a rollback story.

Control plane patterns that actually hold up

Here are patterns that survive contact with enterprise reality:

Two-tier execution: “draft mode” outputs by default, “commit mode” only after explicit confirmation or policy checks.
Tool permissions as scopes: treat each tool/function call like an OAuth scope; default to least privilege.
Data classification at ingestion: label documents/fields (PII, PHI, secrets, regulated) and enforce policy before retrieval.
Model tiering: small/cheap model for extraction and routing, stronger model only for tasks that need it.
“No raw logs” rule: store traces with redaction, hashing, or field-level suppression; set retention intentionally.

Make AI shippable: evaluations as a release gate, not a research project

Teams keep treating evals as optional because they sound academic. They’re not. They’re the only way to ship AI changes without playing roulette.

OpenAI open-sourced Evals for a reason: model behavior changes, prompts change, retrieval corpora change, and your product changes. If you don’t pin expected behavior with tests, every deploy is a hidden product rewrite.

What a pragmatic eval suite looks like

You don’t need a massive benchmark. You need a small set of “this must never break” cases that represent your product’s contracts. A good suite has:

Golden tasks: real inputs with expected outputs (or expected properties).
Adversarial tasks: prompt injection attempts, policy-violating requests, and edge cases.
Retrieval sanity checks: ensure the model cites or uses the right sources, not whatever is most semantically similar.
Tool-call checks: verify the agent calls the correct tools with correct arguments, and doesn’t call forbidden tools.

A CI-shaped interface beats a dashboard-shaped interface

Dashboards are fine for exploration. Shipping requires something stricter: a command that returns pass/fail and artifacts you can inspect in a PR.

# Example: run an eval suite in CI (illustrative command structure)
# The point: one command, deterministic dataset, artifact output.

make eval \
  EVAL_SET=golden_and_adversarial \
  MODEL_PROVIDER=bedrock \
  ARTIFACT_DIR=./artifacts/evals

# CI should fail if thresholds/criteria aren't met
# and upload artifacts (traces, diffs, scored outputs) for review.

The exact toolchain varies (OpenAI Evals, Ragas, DeepEval, bespoke). The product requirement doesn’t: evals must be easy to run, hard to ignore, and visible in the same place as the code change.

developer workflow with code review and checks — If AI changes don’t show up as checks in code review, they’ll ship without accountability.

One control plane, many models: plan for churn as a product constraint

In 2026, model churn is normal. Providers ship new models, deprecate old ones, change safety behavior, change pricing, change rate limits. If your app talks directly to one provider everywhere, churn becomes a rewrite.

Routing isn’t a cost trick. It’s a product reliability feature.

Table 2: AI control-plane checklist mapped to real operational questions

Capability	Concrete question it answers	Where it typically lives	Artifact you should be able to produce
Request policy + redaction	“Did any PII/secret leave our boundary?”	Gateway middleware; DLP hooks	Redaction rules; blocked-request logs; retention config
Model routing + fallback	“What happens if Provider A is down or rate-limited?”	Gateway/router service	Routing policy; failover traces; provider error reports
Prompt/workflow versioning	“Which prompt produced this output?”	Repo + release tags; prompt registry	Immutable version IDs tied to deploys
Evaluation gate	“What changed, and did quality regress?”	CI/CD pipeline	Eval report; failing cases; diffs with traces
Audit + incident workflow	“Can we reconstruct the chain of events for a bad output?”	Central log store; ticketing integration	Trace timeline; input/output redacted record; escalation ticket

The product decision most teams avoid: where the boundary sits

You have to choose: is your control plane a shared internal platform (owned by infrastructure) or a product platform (owned by the product org, with infra partnership)? If it’s “everyone’s job,” it’ll be no one’s job. And you’ll keep paying the platform tax in every squad.

The right boundary is usually: infra owns the runtime, identity, and logging substrate; product owns the evaluation contract, safety policy requirements, and user-visible failure modes. That split matches incentives.

Where this goes next: AI controls become a selling point

In 2023–2025, many teams treated safety and governance as a procurement hurdle. In 2026, it’s turning into positioning. Customers are learning the hard way that “we use GPT-4/Claude/Gemini” says nothing about whether your system is controllable.

Expect RFPs to get blunt:

“Show us your audit log for an AI-generated decision.”
“Show us how you prevent prompt injection in retrieval workflows.”
“Show us how you restrict tool use by role.”
“Show us your evaluation suite and how it blocks regressions.”

If you can answer with artifacts, you win deals. If you answer with vibes, you lose them.

security and compliance review meeting — Governance isn’t paperwork; it’s the difference between shipping AI and being forced to turn it off.

A concrete next move: build the “AI bill of materials” for one workflow

Pick one high-usage AI workflow in your product—support summarization, sales email drafting, code review comments, document Q&A—and produce an “AI bill of materials” for it in a single doc:

Every external call (provider, endpoint, model), including embeddings.
Every data source (which tables, which docs, which user fields), with a data-class label.
Every tool/action the workflow can take, with an explicit allowlist.
Your eval set (golden + adversarial), stored in the repo.
Your audit record (what you log, what you redact, retention).

If you can’t write that document for one workflow, you don’t have a control plane problem—you have an ownership problem. Fix that first.

Then ask the question that actually matters for 2026 product teams: Would you trust your current AI stack if a regulator, an enterprise buyer, or a journalist asked you to reproduce a single bad output end-to-end?

Build until the answer is “yes,” with receipts.