Your org has a ghost contributor—shipping work nobody can defend
You can tell when an AI rollout is going off the rails by the phrases that start showing up in reviews: “Copilot wrote it,” “the agent handled it,” “the model approved it.” That’s not a cute workflow update. It’s a new contributor slipping into production without any of the accountability rules you’d demand from a human.
AI didn’t enter through a formal onboarding path. It arrived as browser tabs, IDE extensions, Slack bots, and homegrown scripts that touch specs, tickets, code, and customer comms. GitHub Copilot normalized code generation, and “agentic” tooling pushed teams toward multi-step automation. The debate isn’t whether it speeds things up. The debate is whether you can reconstruct what happened and who owned the decision.
AI increases output volume, but it also increases variance. It can produce a clean-looking patch that sails through a tired review and quietly breaks a hard-earned invariant. It can draft an incident update that sounds precise while skipping the one missing datapoint that changes the diagnosis. Across an org, variance becomes friction: security clamps down, support sounds generic, and docs get confident while drifting away from truth.
“AI leadership” looks a lot like the moment teams stopped treating deploys as artisanal craft. CI/CD didn’t win because everyone got more careful. It won because teams built defaults—gates, logs, budgets, and incident habits—that made the safe path the easy path. Do that again for AI. If you run it like infrastructure—metered, monitored, auditable—you get speed without turning every month into cleanup.
AI can scale judgment only after you standardize how work is proposed, checked, and shipped. Otherwise you get plausible output floating around with no owner.
Make AI usage reconstructable: provenance, traces, and decision history
Cloud got manageable once teams demanded observability: access controls, logs, budgets, SLOs, and incident response. AI needs the same bargain. “Use it responsibly” is not a control surface.
If leadership can’t answer basic questions—where AI runs, what it can touch, which model/version produced an artifact, and what breaks more often after AI assistance—you don’t have governance. You have wishful thinking.
Prompts and outputs aren’t just chat. Treat AI interactions as first-class artifacts. Capture metadata (you often don’t need full prompt text): who invoked it, which tool, model and version, repo or workflow, timestamp, and what happened next—file created, function edited, PR opened, comment posted, ticket created, message drafted, deployment triggered. That’s not “employee surveillance.” It’s chain-of-custody for work that can change customer experience, risk, and revenue.
Large vendors moved here because buyers forced it. GitHub Copilot for Business ships with organization policies because big companies refuse unmanaged code generation. Microsoft positions Copilot with tenant-level admin controls across its stack. Agent frameworks pushed hard on tracing and replay for one simple reason: tool-using systems fail in ways you can’t debug from the final output.
The cultural shift is the payoff. You stop arguing about vibes and start arguing about reliability. Instead of “AI makes us faster,” you can say “in this workflow, AI-assisted changes correlate with more rework—so we’re tightening tests and raising review requirements for this class of change.”
Kill the blame vacuum: AI never owns the outcome
The quickest way to rot a culture is to normalize plausible deniability. The moment “the model did it” becomes an acceptable postmortem line, quality collapses. The rule that holds up under stress is boring and effective: AI proposes; a named human owns.
This isn’t anti-AI. It’s how operations, auditors, and regulators think: accountability attaches to a person or role, not a tool.
Put AI in your RACI as tooling, not a coworker
RACI becomes useful again when you stop pretending assistants have agency. Put AI in the matrix as something that can execute steps, never something that can be accountable. Example: during incident response, an agent can be Responsible for pulling logs and drafting a timeline, while the Incident Commander remains Accountable for correctness and decisions. In discovery work, AI can be Responsible for clustering feedback, while the PM stays Accountable for prioritization and the narrative.
The goal is to delete the gray zone where everyone assumes someone else verified the output.
Upgrade “review” so fluency stops tricking you
AI output is often well-written. That’s exactly why shallow eyeballing fails. Match review rigor to risk: generated migrations require test evidence; security-sensitive diffs require static analysis plus explicit human approval; auth and billing paths require tighter maintainership rules. None of this is new. AI just increases the volume and confidence of changes, so leaders have to reassert discipline.
“Trust, but verify.” — popular Cold War-era maxim
Watch the quieter failure mode: output inflation. More docs, more tickets, more PRs—without movement in retention, reliability, or revenue. Don’t solve that by banning tools. Solve it by tying AI-enabled throughput to the outcomes that matter and treating everything else as exhaust.
Security and compliance: treat every prompt like data leaving the building
Security teams assume AI will touch sensitive material: source code, support tickets, incident notes, contracts, roadmaps. The old advice—“don’t paste secrets into chat”—doesn’t scale. The stance that scales is zero-trust prompting: treat every model interaction as data egress unless you designed it not to be.
This matches where identity and infra already landed: least privilege, explicit boundaries, and centralized enforcement. It also shows up fast in enterprise security reviews.
Operationally, it’s four moves:
1) Central controls. SSO, SCIM, admin policy enforcement, and audit logs. If you can’t centrally shut it off, you can’t govern it.
2) Defensible data boundaries. Get vendor commitments on retention and training usage for business traffic in writing, and map them to your data classification rules.
3) Secret hygiene on AI pathways. Scan prompts (where stored), logs, and generated output for credentials and sensitive tokens. Use the same mental model as code: secret scanning, push protection, and “assume it will leak unless caught.”
4) Sandboxing + least privilege for agents. If an agent can execute code or call APIs, scope tools tightly and default to read-only until you have evidence it behaves under constraints.
Compliance pressure shows up as procurement pressure. The EU AI Act was finalized in 2024 with obligations phased in over time, and buyers now ask harder questions about governance, transparency, and risk controls. Even if your product isn’t classified as “high-risk,” internal AI usage still intersects with security controls and data processing obligations.
Teams that treat governance as a sales accelerant win deals faster. Walk into a customer security review with an approved-tools list, retention settings, audit access, PII handling rules, and review gates—and the conversation moves. Show up with ad hoc accounts and unclear data flow—and you’ll be stuck in procurement.
Model choice is operating choice: cost, latency, and quality collide
Once AI becomes default, spend control becomes a weekly habit. You’re managing a three-way trade: unit cost per task, responsiveness (latency and reliability), and output quality (accuracy and consistency). “Best model everywhere” turns into budget creep. “Cheapest tokens everywhere” turns into rework and missed details.
Teams that stay sane segment usage into tiers:
Tier 1: low-risk, high-volume work (summaries, formatting, first drafts) routed to fast, cheaper options.
Tier 2: medium-risk work (internal specs, code suggestions) routed to stronger models behind tests and review requirements.
Tier 3: high-risk work (customer-facing legal language, security-sensitive code paths) routed to the highest-trust setup: retrieval with pinned sources, constrained tools, and mandatory human sign-off.
Table 1: Common AI operating patterns for 2026 teams (cost, speed, and control trade-offs)
| Approach | Typical monthly spend (100-person eng org) | Strengths | Failure mode |
|---|---|---|---|
| IDE assistant only | Lower / predictable | Low friction; easy rollout; consistent developer experience | More code lands with uneven validation; limited workflow automation |
| Chat-first knowledge work | Lower / variable | Fast drafting for PM, support, sales, and ops | Data handling drifts; weak provenance; hard to replay decisions |
| RAG over internal docs | Medium | Fewer hallucinations; answers anchored to known sources | Stale content and broken permissions; citations can look credible while being wrong |
| Tool-using agents (workflow automation) | Medium to higher | Automates multi-step work across systems (Git, Jira, CRM, chat) | Permission sprawl; hard-to-debug runs; spend spikes without metering |
Budgeting changes shape, too. AI spend is part subscription, part consumption, part “work shifted from humans to systems.” Manage it by workflow (support triage, incident response, PR assistance, sales enablement) and by outcome metrics (cycle time, MTTR, deflection, renewal risk). If spend can’t be tied to a workflow KPI, it’s a hobby.
AgentOps: if it can write to systems, roll it out like a production service
The moment an agent can open PRs, post in Slack, or file tickets, it’s not a demo anymore. It’s production. That brings the same demands DevOps brought: repeatability, controlled rollouts, and a real incident process.
Here’s a leadership checklist you can push through in a quarter:
- Publish allowed use cases (for example: “draft PR description,” “summarize incident,” “draft support reply”) and block anything that creates external commitments or modifies critical access until it has explicit approval.
- Ship evals early with a small gold dataset per workflow and score for accuracy, completeness, and policy compliance.
- Instrument end-to-end: model/version, tool calls, latency, spend, and downstream acceptance (merged PRs, sent replies, closed tickets).
- Gate high-risk actions: external communication, security-sensitive changes, and any data export require human approval.
- Run “agent incidents” like real incidents: if an agent causes harm or near-harm, write it up and fix the system—not the prompt-of-the-week.
The technical behavior worth insisting on is reproducibility. If the same input produces wildly different outcomes, you don’t have a system—you have roulette. The fix is deterministic scaffolding: pinned retrieval sources, structured outputs (schemas), and constrained tool invocation. Here’s a simplified example of forcing a structured incident summary so it can be stored, compared, and audited.
{
"workflow": "incident_summary_v2",
"inputs": {
"incident_id": "INC-18427",
"log_window": "2026-03-10T02:10Z..2026-03-10T03:05Z",
"sources": ["datadog:service-api", "pagerduty:timeline", "slack:#inc-18427"]
},
"required_output_schema": {
"type": "object",
"required": ["impact", "root_cause", "timeline", "customer_comms"],
"properties": {
"impact": {"type": "string"},
"root_cause": {"type": "string"},
"timeline": {"type": "array", "items": {"type": "string"}},
"customer_comms": {"type": "string"}
}
}
}
Leaders don’t need to write schemas. They do need to require auditable outputs, storable formats, and comparable runs. Treat the agent like a service, and you get control back.
Metrics that catch activity theater
Once AI is everywhere, “we use it” becomes meaningless. Measure whether it changes outcomes or just multiplies artifacts. Use three buckets: leading indicators (adoption), lagging indicators (business impact), and guardrails (quality and risk).
Table 2: A practical scorecard for AI-enabled teams (quick to stand up, hard to game)
| Metric | Target range | How to measure | Why it matters |
|---|---|---|---|
| AI-assisted merge rate | Rises, then stabilizes | Tag PRs created/edited with AI via IDE/plugin metadata | Shows real workflow change without guessing |
| Rollback share of AI PRs | At or below baseline | Link deployments → PRs → rollback events | Guardrail against confident wrong changes |
| Support deflection | Up, with stable CSAT | Track self-serve resolutions vs human-handled tickets | Direct cost and experience signal |
| MTTR change with AI | Down over time | Compare MTTR before/after incident tooling changes | Tests whether summaries and triage help during real incidents |
| AI cost per resolved unit | Down over time | (AI spend) / (tickets resolved, PRs merged, incidents supported) | Prevents spend from outrunning value |
What not to celebrate: “tokens consumed” and “messages sent.” Keep them as denominators, not trophies. The question is whether speed improves while quality holds. If activity rises and rollback share rises too, you didn’t go faster—you just moved the work into a future queue.
Track cognitive load on senior engineers. If the most experienced people become full-time cleanup crews for generated code, you’ve taxed your highest-use role. The healthy pattern is redistribution: juniors move faster with guardrails; seniors spend more time on design and structured review; review is time-boxed and evidence-based. The unhealthy pattern turns seniors into human lint tools.
Key Takeaway
AI doesn’t eliminate management work. It forces you to turn it into operations: ownership, audit trails, permissioning, evaluation, and spend guardrails. If you can’t show those on demand, you’re not running AI—you’re letting it wander through production.
The advantage is operational maturity, not model access
Strong models aren’t scarce anymore. Between frontier providers, enterprise platforms, and open-weight options, “we have AI” isn’t defensible.
The advantage shifts to orgs that can answer—cleanly and quickly—where AI is allowed to act, what it can access, how outputs are evaluated, how failures are handled, and who owns decisions that touch production. That’s leadership work, not ML heroics.
Do one thing next: write a one-page policy for a single workflow you care about (support replies, incident summaries, PR drafting). Include the owner, the allowed actions, the required logs, the review gate, and the metric that would shut the workflow down. Then ask your leads a question that forces clarity: What evidence would make us pause or roll back this automation?