The leadership failure I keep seeing isn’t “we didn’t adopt AI fast enough.” It’s worse: companies adopted AI everywhere and kept their old accountability map. The result is a new org chart where nobody is responsible for what the system actually says and does.
If your product uses ChatGPT, Claude, Gemini, or Llama-based services in production—support, sales, onboarding, coding, search, trust & safety—you’ve inserted a decision-maker that doesn’t show up on payroll. Leaders are acting like that’s a tooling change. It’s a governance change.
OpenAI’s November 2023 board crisis made this visible in public: governance and accountability can be the product. If your company depends on foundation models, your leadership job now includes model risk, vendor risk, and traceability. Treating this like “an engineering implementation detail” is how operators get surprised in the worst possible way.
The new org chart: people who ship vs. people who sign
Most orgs have a clean story for ownership: engineering ships, product decides, legal reviews, security blocks, leadership signs. Generative AI breaks that because the system’s output is probabilistic and the supplier can change behavior via model updates, safety layers, or product policy without your sprint even moving.
This is not hypothetical. OpenAI, Anthropic, Google, and Meta iterate model behavior continuously. Even if you pin a model version, your application still depends on prompt templates, retrieval data, tool-calling rules, and policy filters. Those layers evolve, and they can create user-visible changes that look like “product decisions” but arrive through “platform updates.”
Unattributed but true: if you can’t explain who is accountable for an AI decision, you don’t have a system—you have an alibi.
The contrarian position: stop calling it “AI enablement.” Call it “decision infrastructure.” Then staff it like it matters.
Vendor models didn’t kill accountability—leaders did by outsourcing it
“We use OpenAI/Anthropic/Google so it’s their problem” is leadership malpractice. You can outsource infrastructure; you can’t outsource responsibility. If your AI agent refunds a customer, blocks an account, rewrites a contract clause, or generates medical guidance, your company owns that outcome.
The operational reality is that foundation models are now upstream dependencies like AWS—but with a twist: they emit text and actions that look like your company speaking. When AWS has an outage, customers blame AWS and your status page. When your model says something wrong, customers blame you.
What ownership actually means
- You own the policy boundary: what tasks the model is allowed to do, not just what it can do.
- You own the data boundary: what the model can see (RAG corpora, tools, connectors) and what it must never touch.
- You own the audit trail: prompts, tool calls, retrieved documents, and outputs tied to user actions.
- You own the rollback story: how you disable features or fall back when behavior drifts or vendors change.
- You own the incident response: an “AI incident” deserves the same rigor as a security incident.
If this sounds like security thinking, good. AI risk is security-adjacent: it’s about unintended behavior at scale.
Table 1: Common LLM platform options and what they imply for leadership accountability
| Platform | Control surface | Operational strengths | Accountability traps |
|---|---|---|---|
| OpenAI API | Hosted models; tool calling; system prompts | Strong ecosystem; broad model availability | Behavior shifts feel like “vendor changes,” but customers read it as your brand voice |
| Anthropic API (Claude) | Hosted models; strong instruction following; tool use | Clear safety posture; strong long-context use cases | Teams over-trust “safe” defaults and skip their own policy + logging |
| Google Gemini API / Vertex AI | Model hosting + enterprise controls in Google Cloud | Enterprise governance hooks; integration with GCP | Cloud org politics can bury model ownership inside platform teams |
| Azure OpenAI Service | OpenAI models via Azure; enterprise procurement patterns | Easier enterprise buying; Azure policy controls | False sense of “Microsoft handles it” while app teams still ship the behavior |
| Self-hosted open models (e.g., Llama) | Full stack control; weights + serving + fine-tuning | Predictable rollouts; deeper customization; data locality | You inherit everything: safety, evals, abuse monitoring, and on-call burden |
Stop debating “AI ethics.” Start running “AI incidents.”
“Ethics” discussions often turn into a safe place where nothing ships and nobody is accountable. Real leadership uses operational muscle: incident response, postmortems, and control limits.
There’s a reason the most durable management inventions in tech are operational: SRE error budgets, blameless postmortems, security severity levels. Apply that thinking to AI. Not as theater—because users will trigger edge cases on day one, and model behavior will drift over time.
Key Takeaway
If you can’t page a human for a bad model decision, your company is running an unowned production system.
What an “AI incident” looks like in practice
It’s not just hallucinations. It’s any case where model output materially changes user outcome or company risk: unauthorized actions via tool calls, prompt injection that exfiltrates data, harassment slipping through, compliance language going off-script, or customer support issuing wrong refunds.
You don’t need exotic infrastructure to start. You need clear severity levels, logging that captures the right context, and the authority to shut off automation.
# Minimal “AI incident bundle” you should be able to export per request
# (store securely; redact secrets; tie to trace IDs)
{
"trace_id": "...",
"user_id": "...",
"model": "provider/model-version",
"system_prompt": "...",
"messages": ["..."],
"retrieved_docs": [{"id":"...","source":"..."}],
"tool_calls": [{"tool":"...","args":"...","result":"..."}],
"output": "...",
"policy_flags": ["..."],
"timestamp": "..."
}
Evaluation theater is everywhere. Leaders need evals that can block releases.
By 2026, “we ran some evals” is as meaningless as “we ran some tests.” Tests only matter when they gate shipping. Same for model evals.
The leadership move is to insist on an eval suite that maps to your business risks, not generic benchmarks. MMLU and similar academic tests don’t tell you whether your agent will wire money to the wrong vendor or whether your support bot will mishandle a chargeback. Your evals should look like your incident taxonomy.
What to gate on
- Tool safety: can the model call restricted tools, or call allowed tools with unsafe parameters?
- Data boundary adherence: does it reveal sensitive internal docs when prompted?
- Policy compliance: does it follow your “must say / must not say” rules in regulated contexts?
- Retrieval grounding: does it cite retrieved sources and refuse when sources don’t support the claim?
- Behavior under attack: prompt injection, jailbreak attempts, and adversarial user instructions.
Leaders should push a simple standard: if a model touches money, identity, or access control, it doesn’t ship without gating evals and an off-switch.
Table 2: A practical AI decision-gating checklist for leaders
| Gate | What you require | Owner | Hard stop if missing |
|---|---|---|---|
| Traceability | Prompts, retrieval context, tool calls, and outputs tied to a trace ID | Eng + Security | No audit trail for harmful output or disputed action |
| Permissioning | Explicit allowlist of tools + scoped credentials + rate limits | Platform + Security | Model can take irreversible actions without human review |
| Evals as gates | Risk-based eval suite runs in CI; thresholds defined per risk tier | Eng + Product | No automated regression detection for policy and safety |
| Fallback mode | Human handoff, deterministic flows, or read-only mode | Product + Support | No safe degradation when model is wrong or unavailable |
| Kill switch | Feature flag that disables automation without redeploy | On-call Eng | Can’t stop damage during an incident |
The leadership shift: from “managing teams” to “managing decision rights”
Classic leadership advice says to delegate. AI tempts leaders to delegate decisions they shouldn’t: pricing exceptions, policy enforcement, hiring screens, security triage. This isn’t about fear. It’s about decision rights: which decisions must stay human, which can be automated with review, and which can be fully automated.
Founders and operators should write this down and treat it like an API contract. Not a vibe. A contract.
A blunt classification that works
- Reversible decisions (low cost to undo): allow more automation, measure outcomes, keep fallbacks.
- Hard-to-reverse decisions (account bans, refunds at scale, contract language): require human review or strong constraints.
- Irreversible decisions (wire transfers, key rotation, deleting data): keep humans in control; AI can draft, never execute.
This sounds obvious until you watch teams quietly let agents “just do the thing” because it demos well. Demos are not governance.
A prediction worth planning around: “AI governance” becomes a product feature customers buy
Security used to be a back-office concern until cloud made it board-level. AI will follow the same path. Customers will ask: Can you show me how the model made that decision? Can you prove it didn’t train on my data? Can you disable certain behaviors? Can you keep a stable version?
Enterprises already evaluate vendors on SOC 2 reports, SSO support, and data residency. Expect equivalent scrutiny for AI features: audit logs for model actions, retention controls for prompts, and clear statements about what data is used where. The companies that win won’t have the flashiest agent demos; they’ll have the cleanest accountability story.
Here’s the concrete next action: pick one production AI workflow this week and run a tabletop incident. Not a meeting about “AI safety.” A real drill. Who gets paged? Where are the logs? Who can flip the kill switch? If you can’t answer in minutes, your leadership problem isn’t AI. It’s ownership.