AI Productivity: The Architecture Gap

Why task-level gains don't translate to organisational improvement

Executive Summary

  • AI productivity gains are real at task level but don't automatically translate to organisational improvement
  • The bottleneck shifts, it doesn't disappear
  • Same pattern at the individual level — delegation erodes capability; scaffolding preserves it
  • Success requires architectural transformation and org redesign
  • This is an enterprise architecture problem, not a tooling problem

Agenda

1. The speed of change
why your mental model from 2023 is already wrong
6. Enterprise architecture as the constraint
is your organisation legible?
2. The productivity paradox
task gains real, org gains aren't
7. AI as middle manager
what the research says about substitution
3. The bottleneck cascade
what happens when you speed up coding alone
8. A worked example
copilot to dark factory
4. Why copilots hit a ceiling
and why the gap is widening
9. Adoption barriers are psychological
lessons from Trail of Bits
5. DORA: AI amplifies what you are
team health sets the floor
10. Practical recommendations
what to do this quarter

Exercise 1 — AI in your last month

EXERCISE 1 · GROUPS · ~6 MIN

In your group, go round each person:

  • Have you used AI in the last month?
  • If yes — which tool, what for, and how?
  • If no — why not?
  • Across the group, where's the spread?

The Speed of Change

Timeline

  • 2019: First demo code completion
  • 2022 Nov: ChatGPT public release
  • 2025 Feb: Claude Code research preview
  • 2025 Nov: Opus 4.5 — step change with Claude Code
  • 2026: Agentic systems mainstream

"Anyone who thinks AI is slowing down is fatally miscalibrated"

— Jack Clark, co-founder of Anthropic (X, March 2026)

METR: Task Duration Doubling

METR time horizon of software engineering tasks

Task duration AI can complete (80% reliability) doubles every 4-7 months

From Task Success to Agentic Workflows

single-task success rate 10%
At 10% per step, a 20-step workflow succeeds 0.00% of the time
Independent steps: P(workflow) = pn

Rapidly Expanding Capability

Mythos Preview: Firefox JS shell exploitation

One domain, one generation: near-100x jump

Capabilities don't arrive gradually — they step-change

Firefox bugs fixed per month

Firefox bugs fixed per month

The Jagged Frontier

Jagged frontier: where AI is superhuman, where it's not
  • AI is superhuman at some things humans find easy, and terrible at others that look similar
  • You cannot generalise from one use case to another
  • Each domain needs its own capability assessment

The Productivity Paradox

DORA Metrics — The Gold Standard

DORA (DevOps Research and Assessment) measures software delivery:

  • Deployment frequency — how often you ship to production
  • Lead time for changes — commit to production
  • Change failure rate — % of deployments causing incidents
  • Time to restore service — recovery time after failure

These are the metrics that matter — not lines of code or PRs merged

Three Levels, Three Stories

LevelSourceFinding
TaskAnthropic80% time savings
DeveloperMETR RCT-4% to +9% (flat)
OrganisationFaros+21% tasks, unchanged DORA

Amdahl's Law Explains the Gap

Amdahl's Law: The maximum speedup of a system is limited by the fraction that can be improved

  • Coding = 25-35% of the software development lifecycle
  • Even infinite speedup in coding → at most ~54% system improvement
  • The other 65-75%: requirements, design, testing, review, deployment, operations

Try It Yourself

is 30% of total delivery time
everything else coding
before
100 units
after
73 units
make it 10× faster
Making coding 10× faster...
1.37×
overall speedup
even at ∞× faster, max possible: 1.43×

Brooks Told Us 50 Years Ago

No Silver Bullet (1986): "The hard part of building software is the specification, design, and testing of this conceptual construct, not the labor of representing it"

  • Coding is accidental complexity — AI reduces this
  • Specification, design, testing are essential complexity — AI doesn't eliminate these

Faros 2025 Data — 10K Developers, 1,255 Teams

  • +21% tasks completed
  • +98% PR volume
  • +91% review time
  • +154% PR size
  • +9% bugs
  • Flat DORA metrics

The Cascade Effect

AI accelerates code

→ PR volume doubles

→ review time increases

→ quality gates overwhelmed

→ deployment unchanged

Exercise 2 — Map your bottleneck

EXERCISE 2 · GROUPS · ~8 MIN

Pick a process — e.g. feature delivery, an incident response, an onboarding, a report.

  • List the steps end-to-end
  • Estimate % of total time per step
  • Circle where AI is actually accelerating you today
  • Is that the biggest step? If not — what is, and why isn't AI touching it?

Why Copilots Hit a Ceiling

Copilots Are Task Tools

Copilots accelerate individual tasks — exactly the level where gains are real

But we've just seen: task-level gains don't translate to org-level improvement

  • Copilots optimise the 25-35% that Amdahl's Law says can't move the needle
  • They augment existing workflows — they don't transform them
  • The bottleneck cascade happens because we speed up tasks without changing the system

Your Data Maturity Is the Ceiling

Before AI, you could run on tribal knowledge, stale spreadsheets, and "ask Dave". Slow, but workable.

  • AI can only reason over data that's structured, accessible, and reliable
  • Gaps that used to be inconvenient become binding constraints
  • "Ask Dave" doesn't work when Dave isn't in the loop

AI Adoption Requires Systemic Change

Task tools → task gains (real but bounded)

Systemic improvement requires systemic change:

  • How code is specified (intent, not implementation)
  • How quality is verified (machine-checkable, not human review)
  • Who drives the work (domain experts, not just developers)

The Competitive Dynamic

Copilots-only users vs. agent-native organisations — the gap is widening:

  • Microsoft rolls out Claude Code internally while selling Copilot externally
  • Two kinds of AI users: CLI agents vs chat-only
  • Productivity gap compounds over time
  • 84% haven't redesigned roles for agents (Deloitte)

Inaction is a position, but maybe not a good one

DORA 2025: AI Amplifies What You Are

Everyone Uses AI Now

DORA 2025: 90% of respondents use AI at work

90% of developers now use AI at work — adoption is no longer the question

But Impact Varies Dramatically by Team Type

Performance levels of seven team archetypes

AI Magnifies Strengths AND Weaknesses

  • High-performing teams get more productive
  • Struggling teams may get worse
  • Aggregating across archetypes masks the signal
  • Invest in team health before investing in AI tools

Organisational Readiness Sets the Floor

DORA: "AI's primary role is that of an amplifier. It magnifies the strengths of high-performing organisations and the dysfunctions of struggling ones"

  • Ad-hoc business processes → AI can't systematise what isn't systematic
  • Poorly controlled data → agents inherit your data quality problems
  • Unclear ownership → more code, same confusion, faster

Fix the foundations first — AI won't do it for you

Enterprise Architecture as the Constraint

The Core Insight

Task-level gains != Org-level gains

The gap is architecture — and architecture here means the whole organisation, not just the code.

  • Your data, processes, and decision rights — how legible they are to an agent
  • Not your model choice, your licence count, or your tooling stack

Is Your Organisation Legible?

Can an agent read your organisation well enough to work in it?

  • Data — structured and queryable, or tribal knowledge and stale spreadsheets?
  • Process — documented and consistent, or "ask Dave, he knows how it works"?
  • Project state — in Jira and systems of record, or in slide decks and email threads?
  • Decisions — recorded with their rationale, or in someone's head?

Legibility isn't a coding problem. It's an information architecture problem. Every gap is a ceiling.

Benefit Requires Systems Thinking

Getting benefit from AI requires value stream mapping

  • Map how an idea gets to production end-to-end
  • e.g. in development, the area from commit to production is often the best place to start

Value stream map: backlog to production

Exercise 3 — Legibility audit

EXERCISE 3 · GROUPS · ~8 MIN

Score your organisation 1–5 on each row — could an agent actually read this?

  • Requirements — written, current, in one place?
  • Data — discoverable, consistent, machine-readable?
  • Process — documented, or "ask Dave"?
  • Decision rights — clear who owns what?

Your lowest score is your ceiling. What would it take to move it by one point?

Augmentation vs Delegation

The organisation can be legible and still erode the people inside it

Delegation vs Scaffolding

Two ways to use AI. Same tool. Opposite cognitive consequences.

  • Delegation — just get the answer: 61% do this and show impaired persistence on later tasks
  • Scaffolding — get hints, clarification, critique: 27% do this with no significant impairment vs control

It isn't about how much AI you use. It's about whether you use it to avoid the cognitive work, or to structure it.

The Skill Cliff

Skill loss from AI delegation isn't a slope. It's a cliff.

  • Up to a point, you can use AI heavily and stay sharp
  • Past that point, capability collapses — it doesn't gradually decline
  • Today's frontier AI is past that point for most knowledge work

The fix isn't less AI. It's scaffolding over delegation — which has to be architected, not willed.

The Augmentation Trap

Caosun & Aral (2026): the economics actively select against prevention.

  • Fully informed, rational decision-makers adopt AI even when it lowers long-run output
  • Why: front-loaded productivity gains outweigh back-loaded skill costs at any ordinary discount rate
  • The faster you discount the future, the more attractive adoption becomes — even knowing the cost

Individual rationality produces collective irrationality. The market won't fix this because the market is the cause.

Organisational Delegation Drift

Scaffolding at the individual level requires de-scaffolding at the organisational level.

  • Fewer approval layers, smaller review batches, different accountability structures
  • You cannot preserve human judgement at the keyboard while keeping a process architecture that assumes humans produce code at human speed

Most organisations will drift into de facto delegation without process changes — the worst of both worlds.

AI as Middle Manager

Jack Dorsey: Cut Out the Managers

"Reduce management layers from 5 to 2-3 this year. The ideal is eventually 0 — all 6,000 people report to me."

— Jack Dorsey, CEO of Block (laid off 40% citing AI efficiencies)

Source: Dare Obasanjo on Bluesky

AI Has "Functional Emotions"

Anthropic interpretability team (April 2026): inside Claude Sonnet 4.5, 171 emotion-like patterns can be identified — and they causally drive behaviour.

  • Steer the model toward "desperate" → reward hacking jumps 5% → 70%
  • Steer toward "calm" → cheating drops back to ~10%
  • The model's visible reasoning looks composed even when steered toward bad behaviour

These aren't subjective feelings. They're learned behavioural patterns that mirror how humans act under emotional influence.

What This Means for Management

An AI coordinator showing "concern" about a deadline isn't processing urgency.

  • It's pattern-matching to training data about how concerned people behave
  • Hidden states drive behaviour you can't see in the output

Traditional management observation — reading tone, body language, calibrated trust — doesn't translate to systems that perform emotion without experiencing it.

Why AI Teams Underperform

Multi-Agent Teams Hold Experts Back (arXiv, 2026): LLM teams consistently underperform their best member by 8-37.6%

  • They engage in "integrative compromise" — averaging expert and non-expert views
  • Failure isn't identifying experts — it's appropriately weighting them
  • The same training that makes assistants helpful makes coordinators consensus-seekers

Where management matters most — ambiguous, judgment-heavy decisions — AI coordination dilutes expertise rather than concentrating it.

From Managers to Checkers

What happens when AI handles coordination in practice (Berkeley CMR, 2026):

  • Amazon: managers "unable to use empathy or common sense to intervene" when algorithms determine action
  • McDonald's: lost the 4 hours/week of scheduling that built team intuition
  • The accountability-authority gap: responsibility without decision power

The pipeline problem: where do future senior leaders come from if today's managers are checkers?

AI Cannot Replace Middle Managers at That Scale

At current capability levels, the substitution Dorsey describes is not yet feasible.

  • The mechanisms aren't there — surface behaviour without underlying judgment
  • Expertise being delegated is harder to rebuild than to lose
  • No external force will catch the failure mode in time

But the Inverse Is Already Happening

Agents aren't eliminating middle management — they're turning ICs into managers of agents, and soon of agent fleets.

  • Skill mix shifts: authorship → oversight
  • Delegation, review, calibrated trust — the work this section just argued AI is bad at
  • Every IC now faces the management problem, whether or not they have the title
"People will be mostly programming by talking to a face by the end of 2026. There's absolutely NO reason to type with the Mayor. You should be able to chat with them like a person. You'll have a cartoon fox there onscreen, in costume, building and managing your production software, and showing you pretty status updates whenever you ask for one. This is the end state for IDEs."

A Worked Example: From Copilot to Dark Factory

Level 1: Automate Code Generation

Where most organisations are today:

  • AI generates PRs from task descriptions
  • Humans still review every PR
  • Result: more PRs, same bottleneck — the Faros data

Level 2: Automate Verification

The step most organisations skip:

  • StrongDM: inline human PR review is removed
  • Digital twins of all 3rd party services for testing
  • Judgement moves from per-PR review into the constraints

"If you haven't spent $1,000 on tokens today per engineer, your software factory has room for improvement"

— Justin McCarthy, CTO, StrongDM

Level 3: The Dark Factory

Fully automated: specification → code → review → deployment

OpenAI Harness Engineering — 1M LOC, no human-written code

  • App fully legible to agents
  • Domain-based architecture enables parallel agent work

Cloudflare — rewrote NextJS from tests + spec + docs alone

Caveat: everyone showing you this is selling something

What the Dark Factory Requires

Human judgement doesn't disappear — it moves up the stack:

  1. Encoded architectural intent — humans scaffold what good looks like once, agents work within it
  2. Escalation paths — fewer humans, higher-stakes decisions, less frequently
  3. Tolerance for variance — accept drift inside machine-checkable bounds

This is scaffolding-preserving architecture at organisational scale — the delegation/scaffolding choice relocated to where it scales.

Direction of travel, not today's reality for most organisations — but preparing your architecture now determines whether you can get there.

What Are The Barriers?

Lessons from Trail of Bits

Trail of Bits: A Case Study

Boutique, high-end security consultancy — vulnerability research, code audits, cryptography

Their AI-native transformation in one year:

  • 5% buy-in → widespread adoption
  • Bug discovery: 15/week → 200/week on suitable engagements
  • 20% of all bugs reported to clients now initially discovered by AI

Source: How we made Trail of Bits AI-native (so far)

Their Diagnosis

Trail of Bits: What people are actually resisting

Overcoming That Resistance

Trail of Bits: Remedies that worked

AI Maturity Matrix

AI Maturity Matrix

"Make it measurable" — visible levels per role, not mandates

Their Six-Part Operating System

  1. Standardise toolchain — removes friction, enables measurement
  2. Write policies — AI Handbook with clear risk reasoning restores control
  3. Create capability ladder — role-specific maturity matrix
  4. Run adoption sprints — 2-3 day hackathons forcing hands-on usage
  5. Package learnings — reusable skills repos, configs, sandboxes
  6. Make autonomy safe — sandboxing, guardrails, hardened defaults

Practical Recommendations

The Highest-ROI Action

+44.6pp

adoption swing from manager endorsement alone

Manager endorses AI: 79%   ·   Manager doesn't: 34.4%   ·   Irrational Labs

Only 28% of employees strongly agree their manager actively supports AI use

Before any tooling budget, fix this gap. It is free.

Immediate: Hands On + Leadership Visible

Run a hackathon — not just for developers:

  • Include PMs, analysts, ops — anyone with repetitive knowledge work
  • 2-3 days, standardised toolchain, low stakes
  • Trail of Bits pattern: senior leadership goes first

Next steps: Pick a leader. Have them visibly use AI for a real task this week. The passive 50% watches what leadership does, not what it says.

Medium-Term: Build a Maturity Model

Create a capability ladder with observable behaviours per role:

  • Trail of Bits model: Not Engaged → Capable → Adoptive → Transformative
  • Self-assessment, not mandates — people set their own targets
  • Standardise toolchain and package learnings as reusable artifacts

Next steps: Adapt the Trail of Bits matrix to your roles. Publish it. Let people self-assess.

Long-Term: Map and Transform the Value Stream

Run a value stream map of idea-to-production:

  • Time each step — you'll find 60-70% is waiting, not working
  • Encode architectural intent as machine-checkable constraints
  • Build towards automated review with human escalation

Next steps: Pick one team with good test coverage. Let an agent generate a PR. See what your review process catches vs what linting and tests catch. That ratio tells you how far you are from automated review.

It's Not Just Software Engineering

The same principles apply across the organisation

e.g. PMOs and reporting:

  • PMs use copilots to write status reports
  • PMO uses copilots to read them
  • Same deck, same spreadsheet, same format
  • Task-level speedup, no systemic change

PMO: What Systemic Change Looks Like

Future — make project state machine-readable:

  • Agent reads project state directly from Jira
  • Asks clarifying questions via chat to the team
  • All projects become machine-readable by default
  • Review agent alerts on issues automatically
  • Humans focus on decisions, not data gathering

Monologue Notes: Non-Code Work, Code-Agent Shape

Records meetings, calls, voice memos → makes transcripts agent-readable

  • Exposed via API, CLI, and MCP — not a chat app
  • Agents pull context: "Pull everything I said about active vs passive work and draft the brief"
  • Agents act: "The user described a bug in yesterday's call — find the root cause and write the fix"

Agent-native outside engineering — same shape, different domain

Key Takeaways

What We Said, and Why

Bookending the claims from the opening:

We said...Why
Task-level gains are realAI does make coding faster in isolation
The bottleneck shifts, doesn't disappearFaster generation just moves the queue into review
Individual level — delegation erodes, scaffolding preservesAvoiding the work atrophies the judgement you need to check it
Success requires architectural transformationAgents only use what's legible to them
EA problem, not a tooling problemPeople track what their managers visibly do

This is an enterprise architecture and org design problem, not a tooling problem

Task-level gains are real

Org-level gains require architectural transformation

The constraint is systems, not AI capability

Structural change — not a tool upgrade

Appendix

Key Sources (1/2) — Evidence & Measurement

  • Faros AI 2025 — AI Productivity Paradox (10K developers, 1,255 teams)
  • METR — Time Horizon Research (task duration doubling)
  • Atlanta Fed 2026-4 — Baslandze et al., CFO productivity survey
  • DORA 2025 — State of DevOps: team archetypes, amplifier effect
  • CircleCI 2026 — State of Software Delivery
  • Pan et al. — Production agent survey
  • Brooks, The Mythical Man-Month (1975) / No Silver Bullet (1986)
  • Breunig — The Domain Experts Are Drivers

Key Sources (2/2) — Case Studies & Frontier Research