Copilots didn’t just speed up coding — they made output metrics lie
If your team still celebrates commit counts, PR volume, or tickets closed, you’re reading a dashboard that AI can spoof. Copilots can produce a week of diffs in a morning. That doesn’t mean you shipped value. It means you generated artifacts.
Public signals are already pointing the same direction. Shopify has pushed an “AI-first” posture: assume AI can draft the first pass. GitHub’s own guidance around Copilot keeps circling the same guardrails: reviews, tests, and policy. OpenAI’s repeated warning across releases is consistent too: model output is a suggestion, not an approval. Those aren’t hot takes about tooling; they’re instructions about accountability.
The failure mode rarely starts with “we couldn’t write the code.” It starts with “we never pinned down the decision.” Once drafting is cheap, the expensive part moves upstream: priorities, interfaces, data boundaries, rollout strategy, and what you’re willing to undo. If leadership doesn’t force decisions to be explicit, the org will fill the gaps with plausible code and confident explanations.
Stop trying to “inspire.” Build a decision machine.
Old management pain was energy: keeping people moving in the same direction. The new pain is altitude: making sure the right decisions happen at the right level, with the right proof, before the copilot cranks out ten clean implementations of a bad idea.
Speed turns small ambiguity into expensive work. A fuzzy requirement becomes a spray of PRs. A sloppy boundary propagates across services. A questionable library spreads everywhere because “it worked once.” Teams that stay fast aren’t more intense; they’re stricter about what work is allowed to start.
This is why mechanisms that look old-school suddenly work again: clear ownership, written artifacts, and decisions that survive the meeting. Amazon’s emphasis on ownership and written narratives exists for a reason. Stripe’s culture of RFCs and internal memos exists for a reason. You don’t need to cosplay any one company. You do need a place where intent is durable and searchable, so you don’t run production systems on vibes and AI-generated diffs.
Sort decisions by altitude (and stop letting PRs smuggle architecture)
High-functioning orgs separate decisions by altitude—strategy, product, architecture, implementation, operations—and they make the boundary visible.
Architecture decisions (data stores, eventing patterns, identity boundaries, cross-service contracts) should not be “whatever got merged.” Put them in an RFC with cross-functional review, a threat model, and a clear statement of reversibility. Implementation decisions (refactors, helpers, tests, small optimizations) can live in the PR flow—if CI gates and review checklists are real and enforced.
Decision latency becomes your bottleneck (treat it like uptime)
Once drafting is fast, the wait shifts to approvals, unresolved ambiguity, and cross-team dependencies. If your security review takes longer than building the feature, you didn’t speed up delivery—you taught the org to bypass guardrails.
Run decision latency like an ops metric: track it, set expectations, staff it. If a review lane is constantly blocked, fix the system: office hours, better templates, explicit ownership, and a turnaround target leadership protects.
Table 1: Execution patterns that show up in AI-heavy engineering teams (2026)
| Model | Best for | Core mechanism | Typical failure mode |
|---|---|---|---|
| PR Factory (Copilot-heavy) | Repeatable features with stable conventions | AI-generated diffs plus hard CI/review gates | Reviewer burnout; slow architectural drift |
| RFC-First (Write, decide, then build) | Platform and high-blast-radius changes | Short written proposals and a decision log | Process sprawl; needless friction for small work |
| Boundary Teams (API/domain ownership) | Many services and many internal consumers | Contracts, versioning rules, and on-call ownership | Local optimization; weak end-to-end coherence |
| Quality SLO Teams (Reliability-led) | High-availability and regulated systems | SLOs, error budgets, and release gates | Shipping stalls if targets aren’t realistic |
| Customer-Outcome Squads | Funnels, activation, retention, UX iteration | Metric ownership tied to releases | Debt accumulates behind experiments |
Stop arguing about quality. Demand proof.
Copilots write convincing code. That’s exactly why they’re dangerous: the code looks tidy, reads well, and fails where your intuition won’t catch it—edge cases, weird data, concurrency, permission boundaries, and operational behavior under load.
So “looks good” can’t be your standard. Your standard is evidence: tests, scanners, policy checks, and a small set of metrics that stay honest even when everyone is excited.
Use the obvious rule: if AI can generate the correct version quickly, it can generate the incorrect version just as quickly. Your engineering system exists to reject the incorrect version early. That means CI that blocks merges, contract tests where boundaries matter, dependency and secret scanning, and observability you actually trust. SRE discipline still matters because it forces explicit tradeoffs; SLOs and error budgets turn “quality” into a constraint instead of a debate.
Redefine “done” while you’re here. “Merged” is a developer milestone. It’s not a customer outcome. “Done” should mean deployed, observable, and tied to a success signal you can monitor. If your AI-assisted speed increases incidents, support load, or unit cost, you didn’t get faster—you relocated the cost to operations and customers.
“If you can’t measure it, you can’t improve it.” — Peter Drucker
Keep a weekly scorecard small and ruthless. Pick indicators that punish self-deception: change failure rate, MTTR, escaped defects, unit cost, and security findings by severity. DORA metrics can still earn a seat, but only as a set. Shipping more often while breaking more often is just failure at higher frequency.
Fix incentives or you’ll ship beautiful garbage
Once output is cheap, reward systems that pay for visible artifacts become corrosive. You’ll get giant PRs, “helpful” refactors nobody asked for, and automated motion that reads great in status updates. If you keep the same incentives, the org will optimize for what’s easiest to display: more code.
Switch to outcome incentives: customer impact, reliability improvement, and reusable foundations that make other teams faster. Attribution gets messy. Good. Messy attribution beats clean metrics that push the org toward the wrong behavior.
This is the concrete version of “context, not control.” Netflix popularized the phrase; the AI-era translation is: state the constraints, then judge outcomes and risk. If someone ships fast with a copilot, the questions aren’t about speed. Did a metric move? Did operational load go down? Did we reduce the probability of a known failure class? If you can’t tie work to an outcome, tie it to risk reduction and maintainability.
Value the quiet work that makes AI safe: paved roads, templates, policy-as-code, review heuristics, and internal platforms that prevent a zoo of one-off services with surprise security and ops behavior.
One practical move: rewrite your career ladder examples. “Built feature X” is weak. “Made feature X safe to operate and easy to change, with a decision record and clear ownership” is strong.
Governance that doesn’t metastasize into meetings
“Governance” gets hated because it often means approvals without standards. In AI-assisted engineering, governance is how you keep speed without stepping on predictable landmines: data exposure, licensing mistakes, insecure defaults, and cost surprises. The target isn’t a committee. The target is enforced constraints.
Security makes the point cleanly. Many teams already run dependency scanning (Snyk, GitHub Advanced Security, GitLab scanners), secret detection, and SBOM tooling. The leadership decision is whether these checks are optional. If you claim “no critical vulnerabilities,” then CI must block merges that violate it. If engineers can paste sensitive data into an unapproved model endpoint, that’s not a “policy” failure. It’s a tooling, access-control, and workflow failure. Fix it with approved tools, DLP controls where appropriate, and rules that are easy to follow and hard to bypass.
A 2026 baseline: four automated guardrails worth enforcing
- Access: least-privilege defaults (SSO, short-lived credentials) plus recurring access review.
- Code safety: CODEOWNERS on critical paths and required approvals for auth, billing, and sensitive data modules.
- Data handling: classification labels (public/internal/confidential/restricted) with enforcement where restricted data can flow.
- Cost controls: budget alerts and unit-cost dashboards for core actions, including inference where applicable.
Cost governance is now product and finance territory, not just an infra detail. AI features can rewrite unit economics. Leaders should require teams to explain, in plain language, what drives cost and what happens under a usage spike: caching, model choice, fallback behavior, and hard limits where needed.
Table 2: A decision checklist for AI-assisted engineering work (use in planning and review)
| Decision area | Ask | Evidence required | Owner |
|---|---|---|---|
| Customer outcome | What changes for the user, and what signal proves it? | Baseline plus target metric; measurement plan | PM + Eng lead |
| Reliability | Which SLO might this hit, and how do we back out? | SLO impact note; runbook and rollback steps | Service owner |
| Security & data | Does this touch restricted data, auth, or billing paths? | Threat model; scanner output; data classification | Security partner |
| Cost | What drives unit cost, and where is the stop-loss? | Unit-cost estimate; scaling assumptions; caps and alerts | Eng + Finance |
| Reversibility | How hard is this to undo, and what’s the path back? | Migration plan; feature flag or backout plan | Tech lead |
The manager’s new job: debug the workflow
AI shifts the manager’s center of gravity. You’re not unblocking syntax. You’re debugging the system: review capacity, unclear specs, fuzzy ownership, brittle releases, and incentives that reward the wrong behavior. The managers who win treat execution like an ops pipeline: clear inputs, hard gates, and continuous tightening.
Start with review. If copilots increase PR volume, the naive answer is “review more.” That collapses. You need review architecture: smaller PRs, stronger automation, and crisp expectations for what humans do (correctness, security, interface design) versus what automation does (formatting, linting, baseline tests). CODEOWNERS isn’t optional in sensitive areas. A rotating “review captain” can keep flow moving without burning out the same two people.
Planning needs the same reset. AI makes tasks look small because diffs are easy to generate. Plan around risk, not effort. Anything touching auth, money, or restricted data is high risk even if the diff is tiny. In 1:1s, ask questions that flush risk early: What assumption is doing the most work? What decision is blocked? What failure mode are you not writing down? The goal is to surface constraints while you can still change course.
# Example: a lightweight PR template that forces “proof” over persuasion
#.github/pull_request_template.md
## Outcome
- What user/customer metric does this aim to move?
- Link to spec/RFC:
## Risk
- Security/data touched? (Y/N) Details:
- Reliability impact / SLO considerations:
- Rollback plan:
## Evidence
- Tests added/updated:
- Screenshots/recordings (if UI):
- Observability: dashboard or log query link:
This isn’t bureaucracy cosplay. It makes review faster, makes decisions readable later, and trains the team to think in outcomes and evidence—even while the copilot offers endless alternative implementations.
A 30–60–90 cadence that changes behavior (not just tool access)
The most common failure is treating AI like procurement: buy seats, post guidelines, announce “go use it.” That increases output and reduces coherence. If you want speed without chaos, change measurement, review, and trust at the same time.
Days 1–30: make reality visible. Pick a small set of delivery and quality metrics, then add two AI-era signals: review load (opened vs reviewed) and unit cost for core actions. If reliability is shaky, stop fantasizing about speed. Fix the basics: alerts, runbooks, ownership, rollback paths.
Days 31–60: add constraints that keep velocity safe. Use a PR template. Put CODEOWNERS on critical modules. Enforce CI gates for severe findings and secret detection. Require lightweight RFCs for irreversible changes, and start a decision log people can actually search and reuse.
Days 61–90: scale autonomy with boundaries. Build paved roads: starter repos, standard observability, deployment templates, approved model usage patterns. Then update incentives so outcomes and operational quality win promotions—not PR volume.
A question worth sitting with: if your team doubled its code output next month, would customers notice improvement—or would you just arrive at the same incidents sooner?
Key Takeaway
Copilots made execution cheap. Leadership is now decision design: explicit ownership, enforced guardrails, and proof-based “done” so speed doesn’t turn into hidden risk.
- Measure what bites back: delivery, reliability, review load, unit cost, and security findings.
- Make “done” operational: deployed, observable, and tied to a success signal.
- Enforce constraints in CI: scanners, secret detection, required reviews, and cost alerts that block bad merges.
- Separate decision altitudes: RFCs for irreversible architecture; PR flow for implementation detail.
- Reward outcomes: customer impact, reliability gains, and reusable foundations—not artifact volume.