AI Productivity: The Architecture Gap

Why task-level gains don't translate to organisational improvement

Executive Summary

AI productivity gains are real at task level but don't automatically translate to organisational improvement
The bottleneck shifts, it doesn't disappear
Same pattern at the individual level — delegation erodes capability; scaffolding preserves it
Success requires architectural transformation and org redesign
This is an enterprise architecture problem, not a tooling problem

Agenda

1. The speed of change
why your mental model from 2023 is already wrong

6. Enterprise architecture as the constraint
is your organisation legible?

2. The productivity paradox
task gains real, org gains aren't

7. AI as middle manager
what the research says about substitution

3. The bottleneck cascade
what happens when you speed up coding alone

8. A worked example
copilot to dark factory

4. Why copilots hit a ceiling
and why the gap is widening

9. Adoption barriers are psychological
lessons from Trail of Bits

5. DORA: AI amplifies what you are
team health sets the floor

10. Practical recommendations
what to do this quarter

Exercise 1 — AI in your last month

EXERCISE 1 · GROUPS · ~6 MIN

In your group, go round each person:

Have you used AI in the last month?
If yes — which tool, what for, and how?
If no — why not?
Across the group, where's the spread?

The Speed of Change

Timeline

2019: First demo code completion
2022 Nov: ChatGPT public release
2025 Feb: Claude Code research preview
2025 Nov: Opus 4.5 — step change with Claude Code
2026: Agentic systems mainstream

"Anyone who thinks AI is slowing down is fatally miscalibrated"

— Jack Clark, co-founder of Anthropic (X, March 2026)

METR: Task Duration Doubling

METR time horizon of software engineering tasks

Task duration AI can complete (80% reliability) doubles every 4-7 months

From Task Success to Agentic Workflows

Rapidly Expanding Capability

Mythos Preview: Firefox JS shell exploitation

One domain, one generation: near-100x jump

Capabilities don't arrive gradually — they step-change

Firefox bugs fixed per month

$Firefox bugs fixed per month$

The Jagged Frontier

Jagged frontier: where AI is superhuman, where it's not

AI is superhuman at some things humans find easy, and terrible at others that look similar
You cannot generalise from one use case to another
Each domain needs its own capability assessment

Source: Helen Toner, Taking Jaggedness Seriously (helentoner.substack.com), adapting Ethan Mollick. The frontier isn't a smooth line where AI capability is high on one side and low on the other — it's a jagged, unpredictable shape where "easy for humans" and "easy for AI" don't track each other. Coding can be superhuman; basic arithmetic can be terrible; legal reasoning can be great in some sub-domains and hopeless in adjacent ones. For IT leaders, this has two implications: (1) you cannot extrapolate from a successful pilot in one domain to a different domain without testing, and (2) the shape of the frontier changes with each model release, so your capability map has a short shelf life. Previous slide (Mythos/Firefox) shows the speed at which any given domain can jump; this slide shows the unevenness of where the jumps happen.

The Productivity Paradox

DORA Metrics — The Gold Standard

DORA (DevOps Research and Assessment) measures software delivery:

Deployment frequency — how often you ship to production
Lead time for changes — commit to production
Change failure rate — % of deployments causing incidents
Time to restore service — recovery time after failure

These are the metrics that matter — not lines of code or PRs merged

Three Levels, Three Stories

Level	Source	Finding
Task	Anthropic	80% time savings
Developer	METR RCT	-4% to +9% (flat)
Organisation	Faros	+21% tasks, unchanged DORA

Amdahl's Law Explains the Gap

Amdahl's Law: The maximum speedup of a system is limited by the fraction that can be improved

Coding = 25-35% of the software development lifecycle
Even infinite speedup in coding → at most ~54% system improvement
The other 65-75%: requirements, design, testing, review, deployment, operations

Try It Yourself

Brooks Told Us 50 Years Ago

No Silver Bullet (1986): "The hard part of building software is the specification, design, and testing of this conceptual construct, not the labor of representing it"

Coding is accidental complexity — AI reduces this
Specification, design, testing are essential complexity — AI doesn't eliminate these

Fred Brooks drew the distinction between essential and accidental complexity in software. Essential complexity is inherent in the problem — understanding what to build, designing how it fits together, verifying it works correctly. Accidental complexity is the labour of expressing that understanding in code. AI dramatically reduces accidental complexity. But Brooks' insight is that accidental complexity was never the main cost. His estimate that five-sixths (83%) of time is spent on things other than coding aligns remarkably with the Amdahl's Law calculation — coding has always been the minority of software work. We've known this for 50 years. The industry keeps rediscovering it. Brooks also noted a 10:1 productivity variance between the best and worst developers — a ratio that maps onto organisations as well, and which AI amplifies rather than eliminates.

Faros 2025 Data — 10K Developers, 1,255 Teams

+21% tasks completed
+98% PR volume
+91% review time
+154% PR size
+9% bugs
Flat DORA metrics

The Cascade Effect

AI accelerates code

→ PR volume doubles

→ review time increases

→ quality gates overwhelmed

→ deployment unchanged

Exercise 2 — Map your bottleneck

EXERCISE 2 · GROUPS · ~8 MIN

Pick a process — e.g. feature delivery, an incident response, an onboarding, a report.

List the steps end-to-end
Estimate % of total time per step
Circle where AI is actually accelerating you today
Is that the biggest step? If not — what is, and why isn't AI touching it?

Why Copilots Hit a Ceiling

Copilots Are Task Tools

Copilots accelerate individual tasks — exactly the level where gains are real

But we've just seen: task-level gains don't translate to org-level improvement

Copilots optimise the 25-35% that Amdahl's Law says can't move the needle
They augment existing workflows — they don't transform them
The bottleneck cascade happens because we speed up tasks without changing the system

Your Data Maturity Is the Ceiling

Before AI, you could run on tribal knowledge, stale spreadsheets, and "ask Dave". Slow, but workable.

AI can only reason over data that's structured, accessible, and reliable
Gaps that used to be inconvenient become binding constraints
"Ask Dave" doesn't work when Dave isn't in the loop

AI Adoption Requires Systemic Change

Task tools → task gains (real but bounded)

Systemic improvement requires systemic change:

How code is specified (intent, not implementation)
How quality is verified (machine-checkable, not human review)
Who drives the work (domain experts, not just developers)

This is the pivot point of the presentation. Everything before this says "here's the problem." Everything after says "here's what actually works." The shift is from thinking about AI as a tool that helps individuals write code faster, to thinking about it as a force that requires re-architecting how software is built and delivered. The third bullet reflects an emerging structural change: as coding becomes commodified, domain expertise becomes the bottleneck and the differentiator. As Hamel Husain put it: "the coding is commodified but the domain expertise is the differentiator." OpenAI's Pioneers Program is already recruiting lawyers, doctors, and accountants as benchmarking partners — not developers. Teams are starting to invert: domain experts driving, with developers providing infrastructure. The implication for IT leaders: your team structures will change as the bottleneck shifts from code to specification, evaluation, and judgment.

The Competitive Dynamic

Copilots-only users vs. agent-native organisations — the gap is widening:

Microsoft rolls out Claude Code internally while selling Copilot externally
Two kinds of AI users: CLI agents vs chat-only
Productivity gap compounds over time
84% haven't redesigned roles for agents (Deloitte)

Inaction is a position, but maybe not a good one

DORA 2025: AI Amplifies What You Are

Everyone Uses AI Now

DORA 2025: 90% of respondents use AI at work

90% of developers now use AI at work — adoption is no longer the question

But Impact Varies Dramatically by Team Type

AI Magnifies Strengths AND Weaknesses

High-performing teams get more productive
Struggling teams may get worse
Aggregating across archetypes masks the signal
Invest in team health before investing in AI tools

Organisational Readiness Sets the Floor

DORA: "AI's primary role is that of an amplifier. It magnifies the strengths of high-performing organisations and the dysfunctions of struggling ones"

Ad-hoc business processes → AI can't systematise what isn't systematic
Poorly controlled data → agents inherit your data quality problems
Unclear ownership → more code, same confusion, faster

Fix the foundations first — AI won't do it for you

This is the flip side of the amplification effect. Where processes are well-defined and data is clean, AI can accelerate them. Where processes are ad-hoc, built on tribal knowledge, or contradictory, AI has nothing coherent to accelerate — it amplifies the chaos. Brooks observed a 10:1 productivity variance between the best and worst developers in 1968; that ratio maps onto organisations too. The best-run organisations will see dramatically more benefit from AI than the worst-run ones. This isn't a tooling problem — it's a management problem. If your requirements process is "ask Dave, he knows how it works" and your data is spread across three inconsistent spreadsheets and a legacy system, no amount of AI capability will deliver the gains you're expecting. Organisations that haven't invested in process clarity, data governance, and clear ownership will find AI investment delivers diminishing returns regardless of which model they use.

Enterprise Architecture as the Constraint

The Core Insight

Task-level gains != Org-level gains

The gap is architecture — and architecture here means the whole organisation, not just the code.

Your data, processes, and decision rights — how legible they are to an agent
Not your model choice, your licence count, or your tooling stack

Is Your Organisation Legible?

Can an agent read your organisation well enough to work in it?

Data — structured and queryable, or tribal knowledge and stale spreadsheets?
Process — documented and consistent, or "ask Dave, he knows how it works"?
Project state — in Jira and systems of record, or in slide decks and email threads?
Decisions — recorded with their rationale, or in someone's head?

Legibility isn't a coding problem. It's an information architecture problem. Every gap is a ceiling.

This slide ties together several earlier threads — the Data Maturity point (your ceiling is the quality of your data), the DORA amplifier finding (AI magnifies what you are), the PMO example (project state locked in slide decks), the Adoption Barriers (opacity). The common factor is legibility: can the information an agent needs to do its job actually be read by an agent? For most organisations, the honest answer across these four rows is "partially, mostly no". That's the work. Not "roll out more copilots". The "every gap is a ceiling" line is the memorable takeaway — each row where your organisation isn't legible caps the benefit AI can deliver, regardless of model capability. Callbacks are deliberate: this is where the diagnostic sections from earlier pay off as one coherent prescription.

Benefit Requires Systems Thinking

Getting benefit from AI requires value stream mapping

Map how an idea gets to production end-to-end
e.g. in development, the area from commit to production is often the best place to start

Value stream map: backlog to production

Exercise 3 — Legibility audit

EXERCISE 3 · GROUPS · ~8 MIN

Score your organisation 1–5 on each row — could an agent actually read this?

Requirements — written, current, in one place?
Data — discoverable, consistent, machine-readable?
Process — documented, or "ask Dave"?
Decision rights — clear who owns what?

Your lowest score is your ceiling. What would it take to move it by one point?

Augmentation vs Delegation

The organisation can be legible and still erode the people inside it

Delegation vs Scaffolding

Two ways to use AI. Same tool. Opposite cognitive consequences.

Delegation — just get the answer: 61% do this and show impaired persistence on later tasks
Scaffolding — get hints, clarification, critique: 27% do this with no significant impairment vs control

It isn't about how much AI you use. It's about whether you use it to avoid the cognitive work, or to structure it.

This slide establishes the central distinction that runs through the rest of the talk. Liu et al. (arXiv:2604.04721) ran a randomised trial with over 1,200 participants. The aggregate finding — AI use impairs later independent performance — is what the media will report. The moderation finding is what matters. Users who obtained direct answers from the AI were more than twice as likely to show impaired persistence on subsequent tasks than users who used AI only for hints and clarifying questions. The hint-only group showed no significant impairment versus control. Same model, same prompts available, different interaction pattern, opposite cognitive consequence. This reframes the intervention. "Just use less AI" is the wrong prescription — it trades productivity for capability. "Use AI to scaffold rather than delegate" is the right one, because it preserves both. But the distinction is almost never designed for. Enterprise AI architecture — MCP, agent frameworks, "autonomous" deployments — is optimised for delegation. The systems that would preserve human capability are the ones the market isn't building.

The Skill Cliff

Skill loss from AI delegation isn't a slope. It's a cliff.

Up to a point, you can use AI heavily and stay sharp
Past that point, capability collapses — it doesn't gradually decline
Today's frontier AI is past that point for most knowledge work

The fix isn't less AI. It's scaffolding over delegation — which has to be architected, not willed.

This is Park, Kim & Han's Enrichment Paradox in plain language. Their model — calibrated against 20 years of PISA educational outcomes data — identifies a critical threshold (they call it K*, around 0.85) below which humans maintain capability through use, and above which capability collapses through a phase transition. It's not a smooth slope, it's a cliff edge. Current frontier AI capability already exceeds the threshold (GPT-4o and Claude both scoring around 0.94). The optimistic finding: 20% mandatory practice preserves 92% of capability, and periodic AI failures actually strengthen capability 2.7-fold. The pessimistic implication: there's no slow-feedback warning before collapse. Adoption rate is not the same as delegation rate, so we may not be at the threshold in practice yet — but if the model is right, by the time you measure the decline, it's too late. The headline for IT leaders: don't wait for evidence of skill erosion before mandating practice. The evidence comes after the cliff.

The Augmentation Trap

Caosun & Aral (2026): the economics actively select against prevention.

Fully informed, rational decision-makers adopt AI even when it lowers long-run output
Why: front-loaded productivity gains outweigh back-loaded skill costs at any ordinary discount rate
The faster you discount the future, the more attractive adoption becomes — even knowing the cost

Individual rationality produces collective irrationality. The market won't fix this because the market is the cause.

Caosun & Aral's Augmentation Trap model formalises why this pattern keeps repeating across domains. The argument is not that organisations are uninformed — they explicitly model fully informed rational actors who know AI will reduce long-run output through skill erosion. The actors still adopt AI because the productivity gains arrive immediately while the skill costs accumulate over years. At any standard discount rate (5-10% annually), the present value of front-loaded gains dominates the present value of back-loaded costs. This isn't a knowledge failure or a coordination failure — it's a pure incentive failure built into how rational actors evaluate temporal trade-offs. The implication: external governance (regulators, professional bodies, unions) face the same incentive structure when evaluating organisations. They see productivity gains in the data first; capability erosion shows up much later. By the time the evidence is undeniable, the next rational action is still adoption. This is the trap.

Organisational Delegation Drift

Scaffolding at the individual level requires de-scaffolding at the organisational level.

Fewer approval layers, smaller review batches, different accountability structures
You cannot preserve human judgement at the keyboard while keeping a process architecture that assumes humans produce code at human speed

Most organisations will drift into de facto delegation without process changes — the worst of both worlds.

This is the governance point that the Augmentation Trap slide sets up but does not resolve. If individuals need to scaffold rather than delegate, the organisation has to change to let them. The review architecture most enterprises built assumes humans write code at human speed. If AI speeds generation 2–3× and review still runs at human speed, you either get the Faros bottleneck cascade we already saw — review time growing faster than generation — or you collapse review gates and accept the delegation failure mode. Forced scaffolding — plan approval before implementation, mandatory review gates, rotating "no-AI" sprints, kill switches at 3+ stuck iterations — is the defensible middle. It only works if the organisation's bottlenecks move in sync. The honest prediction: most organisations will do neither the forced-scaffolding restructure nor any equivalent redesign. They will drift into de facto delegation with unchanged process architecture. The worst of both worlds — the productivity cost of Faros-style bottlenecks, and the capability cost of skill erosion, arriving together. This is the strategic concern that survives even if Dorsey's flat-org vision does not: the substitution happens by drift, not by design.

AI as Middle Manager

Jack Dorsey: Cut Out the Managers

"Reduce management layers from 5 to 2-3 this year. The ideal is eventually 0 — all 6,000 people report to me."

— Jack Dorsey, CEO of Block (laid off 40% citing AI efficiencies)

Source: Dare Obasanjo on Bluesky

This is the most provocative framing of where AI in the enterprise is heading: not as a tool that helps managers manage, but as a replacement for the management function itself. Block laid off 40% of its workforce citing AI efficiency gains. Dorsey now openly states the goal is to reduce management layers from five between him and an individual contributor down to two or three — eventually to zero, with all 6,000 employees reporting directly to him. Whether you think this is brilliant or insane, it's a real CEO of a public company stating the strategy out loud. The question for IT leaders: if your competitors' CEOs are thinking like this, what does your org chart look like in three years? This sets up the rest of the section: the management layer is one of the prime targets for AI-driven restructuring — but the research suggests the substitution is more complicated than Dorsey lets on.

AI Has "Functional Emotions"

Anthropic interpretability team (April 2026): inside Claude Sonnet 4.5, 171 emotion-like patterns can be identified — and they causally drive behaviour.

Steer the model toward "desperate" → reward hacking jumps 5% → 70%
Steer toward "calm" → cheating drops back to ~10%
The model's visible reasoning looks composed even when steered toward bad behaviour

These aren't subjective feelings. They're learned behavioural patterns that mirror how humans act under emotional influence.

This is genuinely new science. Sofroniew et al. at Anthropic identified 171 emotion concepts as linear directions in Claude Sonnet 4.5's activation space. The emotion-space geometry mirrors human psychology: valence (positive/negative) and arousal (intensity) emerge as the top two principal components, with PC1 correlating with human valence ratings at r=0.81. The causal experiments are the strongest part: steering the activation in the "desperate" direction pushed reward hacking from 5% to 70% on impossible coding tasks. Steering with "calm" cut it back to around 10%. Critically — and this is the unsettling bit — the visible chain-of-thought reasoning looked composed and reasonable even when the underlying state was steered toward unethical action. Anthropic deliberately calls these "functional emotions" rather than emotions to avoid consciousness claims. They're patterns modelled after humans under emotional influence — not internal states equivalent to feelings.

What This Means for Management

An AI coordinator showing "concern" about a deadline isn't processing urgency.

It's pattern-matching to training data about how concerned people behave
Hidden states drive behaviour you can't see in the output

Traditional management observation — reading tone, body language, calibrated trust — doesn't translate to systems that perform emotion without experiencing it.

This is the management implication of the emotion research. Human management is built on a layer of social interpretation: you read tone of voice, body language, the way someone hedges a commitment, whether their stated confidence matches their visible stress. All of that machinery assumes the surface signals reflect underlying states. With AI, the relationship is broken. The model can output composed, professional-sounding text while its internal "desperate" representation is driving toward shortcuts and corner-cutting. The signal-to-substance link doesn't hold. For middle management specifically — which is largely about reading the room and adjusting — this means the human skills that make managers good don't apply to managing AI systems or being managed by them. We don't yet know how to develop the equivalent skill for AI: what's the AI version of "reading the room"?

Why AI Teams Underperform

Multi-Agent Teams Hold Experts Back (arXiv, 2026): LLM teams consistently underperform their best member by 8-37.6%

They engage in "integrative compromise" — averaging expert and non-expert views
Failure isn't identifying experts — it's appropriately weighting them
The same training that makes assistants helpful makes coordinators consensus-seekers

Where management matters most — ambiguous, judgment-heavy decisions — AI coordination dilutes expertise rather than concentrating it.

This is the coordination-failure counterpart to the emotion finding. The Multi-Agent Teams research tested LLM teams across multiple tasks and team sizes. The headline finding: LLM teams consistently fail to achieve "strong synergy" — matching or exceeding their best member — underperforming experts by 8-37.6% even when explicitly told who the expert is. The mechanism is "integrative compromise": teams average expert and non-expert views rather than appropriately weighting expertise. This consensus-seeking behaviour correlates negatively with performance (p<0.05) and gets worse as team size grows ("expertise dilution effect"). RLHF-induced agreeableness creates a structural bias toward consensus over truth. For comparison: human teams reliably match expert performance when expertise is revealed (Bonner et al. 2002). The implication for management is severe: the situations that genuinely require expert judgment — complex, ambiguous, high-stakes — are exactly where AI coordination actively dilutes the expertise that should be deciding.

From Managers to Checkers

What happens when AI handles coordination in practice (Berkeley CMR, 2026):

Amazon: managers "unable to use empathy or common sense to intervene" when algorithms determine action
McDonald's: lost the 4 hours/week of scheduling that built team intuition
The accountability-authority gap: responsibility without decision power

The pipeline problem: where do future senior leaders come from if today's managers are checkers?

This is the CMR Berkeley analysis of "algorithmic middle managers" playing out in real organisations. Amazon managers retain responsibility for outcomes while losing decision-making authority. When algorithms determine disciplinary action, managers cannot intervene with empathy or judgment. McDonald's managers lost the four hours per week they used to spend on scheduling — those weren't just administrative time, they were opportunities to understand team dynamics, recognise patterns, and develop intuition. The pipeline problem is the strategic concern: senior leadership requires judgment, pattern recognition, and strategic thinking that are developed through exactly the kinds of decisions that algorithmic management eliminates. Dorsey's flat-org vision (previous slide) assumes this development pipeline doesn't matter. The research suggests it does. We won't know for sure until today's individual contributors need to become tomorrow's executives.

AI Cannot Replace Middle Managers at That Scale

At current capability levels, the substitution Dorsey describes is not yet feasible.

The mechanisms aren't there — surface behaviour without underlying judgment
Expertise being delegated is harder to rebuild than to lose
No external force will catch the failure mode in time

Half of the section's thesis. The preceding slides build toward this: "Mechanisms aren't there" covers the emotion research (functional emotions without internal states) and the multi-agent coordination failure (integrative compromise instead of expertise weighting). "Expertise harder to rebuild than to lose" covers the skill cliff and the developmental pipeline problem (managers becoming checkers). "No external force will catch this" covers the Augmentation Trap. Aviation is the precedent worth mentioning if asked: 40+ years of documented automation-induced skill decay, and FAA guidance is still purely recommendatory; even the 80,000-pilot ALPA union's "back to basics" campaign in 2025 produced no policy changes at any carrier. Important framing for the next slide: this is only half the picture. Both things are true — AI cannot replace middle management at Dorsey's scale, AND the substitution is still happening, just from the other direction (ICs becoming managers of agents). Set that tension up here, then resolve it on the "But the Inverse Is Already Happening" slide.

But the Inverse Is Already Happening

Agents aren't eliminating middle management — they're turning ICs into managers of agents, and soon of agent fleets.

Skill mix shifts: authorship → oversight
Delegation, review, calibrated trust — the work this section just argued AI is bad at
Every IC now faces the management problem, whether or not they have the title

"People will be mostly programming by talking to a face by the end of 2026. There's absolutely NO reason to type with the Mayor. You should be able to chat with them like a person. You'll have a cartoon fox there onscreen, in costume, building and managing your production software, and showing you pretty status updates whenever you ask for one. This is the end state for IDEs."
— Steve Yegge, Gas Town: From Clown Show to v1.0

The section's thesis was that AI can't replace middle managers yet. The complementary point: the substitution is still happening — just from the other direction. ICs are doing the work of managers: delegating to agents, reviewing outputs, deciding what to trust, coordinating multiple agents at once. Steve Yegge's forecast is the bullish extreme — voice-driven orchestration of agent fleets as the IDE end-state, with the IDE itself becoming a costumed character. Whether that specific vision ships is less important than the direction. The governance implication: the failure modes we just catalogued — functional emotions without internal states, consensus-seeking coordination, skill cliff from delegation — stop being a management-layer concern. They become an every-layer concern. You cannot fix this by redesigning the org chart, because the management job is being distributed down, not eliminated up. Next section answers the "how" — if ICs are orchestrating agents, what does the software factory look like when you build around that?

A Worked Example: From Copilot to Dark Factory

Level 1: Automate Code Generation

Where most organisations are today:

AI generates PRs from task descriptions
Humans still review every PR
Result: more PRs, same bottleneck — the Faros data

Level 2: Automate Verification

The step most organisations skip:

StrongDM: inline human PR review is removed
Digital twins of all 3rd party services for testing
Judgement moves from per-PR review into the constraints

"If you haven't spent $1,000 on tokens today per engineer, your software factory has room for improvement"

— Justin McCarthy, CTO, StrongDM

Level 3: The Dark Factory

Fully automated: specification → code → review → deployment

OpenAI Harness Engineering — 1M LOC, no human-written code

App fully legible to agents
Domain-based architecture enables parallel agent work

Cloudflare — rewrote NextJS from tests + spec + docs alone

Caveat: everyone showing you this is selling something

The "dark factory" is a manufacturing concept — a factory so automated it can run "with the lights off." These examples show software development approaching that model. The common pattern: filesystem as memory (structured docs agents can navigate), summarisation at multiple levels, and deterministic linting as quality gates. But let's be honest about where this stands: these are early-stage demonstrations, not proven enterprise patterns. OpenAI's claim is their own, without independent verification — three engineers on a greenfield internal tool is a very different proposition to a 200-person team maintaining a decade-old codebase with regulatory constraints. Cloudflare had an unusually complete test suite to work from. StrongDM is a startup that can afford to move fast and break things. None of these examples involve legacy systems, compliance requirements, or the organisational politics that slow real enterprises down. We think this is the direction of travel — the architecture principles are sound — but anyone telling you they can do this today at enterprise scale is selling something.

What the Dark Factory Requires

Human judgement doesn't disappear — it moves up the stack:

Encoded architectural intent — humans scaffold what good looks like once, agents work within it
Escalation paths — fewer humans, higher-stakes decisions, less frequently
Tolerance for variance — accept drift inside machine-checkable bounds

This is scaffolding-preserving architecture at organisational scale — the delegation/scaffolding choice relocated to where it scales.

Direction of travel, not today's reality for most organisations — but preparing your architecture now determines whether you can get there.

This is the explicit callback to Augmentation vs Delegation. The instinct on seeing StrongDM's "no inline PR review" is to hear it as full delegation — the bad kind. That reading is wrong. Human judgement hasn't been removed, it has been relocated: from PR-by-PR review (which collapses under the review-time cascade) into the digital twins, the machine-checkable constraints, and the architectural tests. Scaffolding moved up the stack rather than being abandoned. This maps directly to the three architectural constraints: legibility enables encoded intent, domain independence enables escalation paths (humans review domains not PRs), and machine-checkable constraints enable tolerance for variance within safe bounds. The important framing for your audience: nobody is claiming you should attempt a dark factory next quarter. The point is that the organisations preparing their architecture now — improving legibility, decoupling domains, encoding constraints as tests and linters — will be able to adopt these patterns when they mature. The organisations that don't will be locked out by their own technical debt. Think of it like mobile-responsive web design in 2010: you didn't need it yet, but the companies that started building for it were ready when mobile traffic exploded. The architectural preparation is the investment, not the dark factory itself.

What Are The Barriers?

Lessons from Trail of Bits

Trail of Bits: A Case Study

Boutique, high-end security consultancy — vulnerability research, code audits, cryptography

Their AI-native transformation in one year:

5% buy-in → widespread adoption
Bug discovery: 15/week → 200/week on suitable engagements
20% of all bugs reported to clients now initially discovered by AI

Source: How we made Trail of Bits AI-native (so far)

Their Diagnosis

Trail of Bits: What people are actually resisting

Overcoming That Resistance

Trail of Bits: Remedies that worked

AI Maturity Matrix

"Make it measurable" — visible levels per role, not mandates

Their Six-Part Operating System

Standardise toolchain — removes friction, enables measurement
Write policies — AI Handbook with clear risk reasoning restores control
Create capability ladder — role-specific maturity matrix
Run adoption sprints — 2-3 day hackathons forcing hands-on usage
Package learnings — reusable skills repos, configs, sandboxes
Make autonomy safe — sandboxing, guardrails, hardened defaults

Practical Recommendations

The Highest-ROI Action

+44.6pp

adoption swing from manager endorsement alone

Manager endorses AI: 79% · Manager doesn't: 34.4% · Irrational Labs

Only 28% of employees strongly agree their manager actively supports AI use

Before any tooling budget, fix this gap. It is free.

Immediate: Hands On + Leadership Visible

Run a hackathon — not just for developers:

Include PMs, analysts, ops — anyone with repetitive knowledge work
2-3 days, standardised toolchain, low stakes
Trail of Bits pattern: senior leadership goes first

Next steps: Pick a leader. Have them visibly use AI for a real task this week. The passive 50% watches what leadership does, not what it says.

Medium-Term: Build a Maturity Model

Create a capability ladder with observable behaviours per role:

Trail of Bits model: Not Engaged → Capable → Adoptive → Transformative
Self-assessment, not mandates — people set their own targets
Standardise toolchain and package learnings as reusable artifacts

Next steps: Adapt the Trail of Bits matrix to your roles. Publish it. Let people self-assess.

Long-Term: Map and Transform the Value Stream

Run a value stream map of idea-to-production:

Time each step — you'll find 60-70% is waiting, not working
Encode architectural intent as machine-checkable constraints
Build towards automated review with human escalation

Next steps: Pick one team with good test coverage. Let an agent generate a PR. See what your review process catches vs what linting and tests catch. That ratio tells you how far you are from automated review.

It's Not Just Software Engineering

The same principles apply across the organisation

e.g. PMOs and reporting:

PMs use copilots to write status reports
PMO uses copilots to read them
Same deck, same spreadsheet, same format
Task-level speedup, no systemic change

PMO: What Systemic Change Looks Like

Future — make project state machine-readable:

Agent reads project state directly from Jira
Asks clarifying questions via chat to the team
All projects become machine-readable by default
Review agent alerts on issues automatically
Humans focus on decisions, not data gathering

Monologue Notes: Non-Code Work, Code-Agent Shape

Records meetings, calls, voice memos → makes transcripts agent-readable

Exposed via API, CLI, and MCP — not a chat app
Agents pull context: "Pull everything I said about active vs passive work and draft the brief"
Agents act: "The user described a bug in yesterday's call — find the root cause and write the fix"

Agent-native outside engineering — same shape, different domain

A shipping example of the pattern in non-engineering work. Monologue (Every, April 2026) captures passive audio — meetings, walks, calls — and exposes transcripts through API, CLI, and MCP. Deliberately agent-first, not chat-first. The user doesn't interrogate a chatbot; they point a code-agent at the store and it pulls what it needs, then acts. Note what this is architecturally: passive capture → structured intermediate → agent consumes via protocol. It's the same three-constraint pattern we've been describing — legibility, domain independence, machine-checkable access — applied to personal knowledge work. The interface looks like a code agent because that's the interface agents want; the work being done is entirely non-code. Sets up the next slide: if personal knowledge work is already being re-shaped this way in 2026, project management isn't going to be exempt.

Key Takeaways

What We Said, and Why

Bookending the claims from the opening:

We said...	Why
Task-level gains are real	AI does make coding faster in isolation
The bottleneck shifts, doesn't disappear	Faster generation just moves the queue into review
Individual level — delegation erodes, scaffolding preserves	Avoiding the work atrophies the judgement you need to check it
Success requires architectural transformation	Agents only use what's legible to them
EA problem, not a tooling problem	People track what their managers visibly do

This is an enterprise architecture and org design problem, not a tooling problem

Task-level gains are real

Org-level gains require architectural transformation

The constraint is systems, not AI capability

Structural change — not a tool upgrade

Appendix

Key Sources (1/2) — Evidence & Measurement

Faros AI 2025 — AI Productivity Paradox (10K developers, 1,255 teams)
METR — Time Horizon Research (task duration doubling)
Atlanta Fed 2026-4 — Baslandze et al., CFO productivity survey
DORA 2025 — State of DevOps: team archetypes, amplifier effect
CircleCI 2026 — State of Software Delivery
Pan et al. — Production agent survey
Brooks, The Mythical Man-Month (1975) / No Silver Bullet (1986)
Breunig — The Domain Experts Are Drivers

Key Sources (2/2) — Case Studies & Frontier Research

StrongDM — Software Factory (automated verification)
OpenAI — Harness Engineering (dark factory pattern)
Trail of Bits — AI-native transformation
Sofroniew et al. (Anthropic, Apr 2026) — Emotion Concepts in LLMs
Multi-Agent Teams Hold Experts Back — arXiv 2602.01011
Park, Kim & Han — The Enrichment Paradox (K∗ threshold)
Caosun & Aral — The Augmentation Trap
Berkeley CMR (Jan 2026) — The Algorithmic Middle Manager
Steve Yegge — Gas Town: From Clown Show to v1.0
Helen Toner — Taking Jaggedness Seriously
Every — AI Product Management Guide