Matthew Hou

Posted on Feb 24 • Edited on Mar 10

Developers Think AI Makes Them 24% Faster. The Data Says 19% Slower.

#ai #testing #discuss #codequality

Last month, METR published a study that should make every developer uncomfortable.

They took 16 experienced open-source developers — people who knew their codebases inside out — and randomly assigned tasks to be done with or without AI tools.

	Predicted	Measured	Post-study belief
Speed impact	+24% faster	-19% slower	"It helped me"

I've been using AI coding tools daily for the better part of a year. When I read that study, my first reaction was "well, those developers must have been doing it wrong." My second reaction was: that's exactly the kind of thinking the study warns about.

The Perception Gap Is the Real Finding

The speed numbers get all the attention, but I think the important finding is the perception gap. We feel faster because AI handles the boring parts — boilerplate, syntax, the stuff that feels like work but isn't where the actual difficulty lives. Meanwhile, the hard parts get harder: understanding what AI changed, verifying it's correct, keeping a mental model of code you didn't write.

Simon Willison — the guy behind Datasette and one of the most prolific AI-assisted developers I know of — wrote something that stuck with me:

"I no longer have a solid mental model of what my projects can do and how they work."

This is a developer who's built 80+ tools with AI assistance. If he's struggling with mental models, maybe the issue isn't experience level.

Why AI Coding Tools Don't Save Time (Yet)

Here's how I think about it now:

Before AI:  Think → Write → Test → Debug
With AI:    Describe → Review → Verify → Debug AI → Debug your understanding

The writing step got cheaper. Everything else got more expensive. And "reviewing code you didn't write" is cognitively harder than "writing code you understand" — anyone who's done code review knows this.

"AI turned us all into Jeff Bezos — automated the easy work, left all the hard decisions." — Steve Yegge

The METR study essentially confirmed what a lot of us have been feeling but didn't want to admit: AI coding tools don't save time. At best, they redistribute where your attention goes. At worst, they create an illusion of productivity while the cognitive load actually increases.

How to Use AI Coding Tools Effectively (What I Changed)

I stopped optimizing for speed. Instead, I started asking: "where is my attention going?"

1. I front-load the thinking, not the prompting.

Before I touch any AI tool, I write down — in plain text — what I want, why I want it, and what "done" looks like. Not for the AI. For me. This takes 5-10 minutes and it's the most impactful thing I do all day, because it forces me to think before generating.

Kent Beck calls this the distinction between "augmented coding" and "vibe coding." The latter is hoping the AI gives you working code. The former is knowing what working code looks like before the AI writes it.

2. I treat verification as the actual job.

I used to think of code review as a chore you do after the real work. Now it IS the real work. StrongDM's team took this to the extreme — their "Dark Factory" setup has zero human code review. All investment goes into tests, tools, and simulations. The humans define what correct looks like. The machines do everything else.

I'm not there yet, but the direction is clear: my value isn't in writing code. It's in defining what "correct" means for my specific context.

3. I stopped measuring productivity in output.

More lines of code is not more productivity. More PRs is not more productivity. The Harness 2025 survey found that 67% of developers spend more time debugging AI-generated code than they would have spent writing it themselves. If that's you, generating more code faster is making things worse, not better.

The metric I care about now: how much of my attention went to decisions only I can make? Architecture choices, user-facing trade-offs, "should we even build this" — that's the stuff AI can't do. Everything else, I want to automate not because it's faster, but because it frees up mental bandwidth for the hard problems.

The Uncomfortable Truth About AI Coding Productivity

If the METR study is right — if AI tools don't actually save time for experienced developers on familiar codebases — then the value proposition of AI coding isn't "10x productivity." It's something more subtle:

The ability to spend your attention on higher-impact work, if you're disciplined enough to actually do it.

That's a much harder sell than "write code faster." It requires you to know what high-impact work looks like, and to resist the dopamine hit of watching AI generate 200 lines in 3 seconds.

I don't have this figured out. Some days I still catch myself vibe coding and pretending the output is good because it compiled. The METR study's perception gap isn't just about their participants — it's about all of us.

But at least now, when I feel productive with AI, I stop and ask: am I actually productive, or does it just feel that way?

Top comments (60)

leob • Feb 25 • Edited

Maybe we should move away a bit from the idea of using AI tools for "coding" only, and use it more in an 'advisory' role instead, as virtual brainstorming buddies to sound ideas off of - to generate ideas ...

Coding, yes, but only for the "boring" stuff, setting up the nitty gritty of a project (tooling etc), pure boilerplate etc - not the parts where writing the code actually feels like a worthwhile thing to do!

Matthew Hou • Feb 25

Yeah, the "advisory role" framing resonates. I've actually been shifting toward that myself — using AI more as a thinking partner than a code generator. The best sessions I have are when I describe a problem and go back and forth on approaches before writing anything.

And you're right about the boilerplate distinction. There's a real difference between "code I need to exist" and "code I need to understand." AI is great at the first category. For the second, I'd rather write it myself and have AI poke holes in it afterward.

leob • Feb 26

Wholly agree! This also reminds me of another recent article on dev.to, where the author argues that actually coding, even before AI arrived, was never more than 20-25% of the work anyway (I think he mentioned an even lower percentage) - the rest is thinking, planning, testing, debugging, deploying etc ... so we're now using AI to automate part of those 20% - maybe we should see how it can help more with those 80% !

Matthew Hou • Feb 28

That's the reframe I keep coming back to. We're optimizing the part of the job that was already the smallest slice. The 80% — understanding requirements, debugging across system boundaries, figuring out what to build in the first place — that's where the real leverage is. I've been getting more value from using AI as a thinking partner during design than as a code generator during implementation. The code part is almost the easy part.

leob • Mar 1 • Edited

Totally, and the advantage is also that it's low risk - you ask for advice or ideas, and then you use them or you don't - but when AI spits out a few hundred lines of code, the onus is on you to check/review it, and make sure there are no bugs or security holes in it ... I do think the whole "AI for coding" debate might need a bit of a rethink as to what 'strategies' are in fact most productive (smallest pains, biggest gains) ... keep an open mind!

Matthew Hou • Mar 2

Exactly — and the low-risk part is what makes it a no-brainer starting point for people who are still hesitant. You ask for advice, you evaluate it, you use it or you don't. No one's committing AI-generated code to main in that workflow. It's actually the safest possible way to get value from AI while building intuition for where it's reliable and where it falls apart. I've started calling it "advisory mode first, generation mode later" — and honestly, some tasks never graduate from advisory mode, and that's fine.

leob • Mar 2

Totally agree, that's also the way I see it - safety first, unless you're "vibe coding" some sort of funny hobby project and it doesn't really matter ...

Hilton Fernandes • Feb 25

I think AI is useful for developing code in code bases one is not acquainted to. Due to its nature of learning from existing code, it usually brings fragments of code that are up to date with the new and updated version of API's and techniques. It's useful too for creating routine tasks that are already very well established -- that is, boiler plate code. It doesn't particularly excel in new tasks. In this case the generated code should be seen as prototype coding: it exposes problems and possible solutions, but it's code that's not ripe and must be used to inspire the writing of useful code.

Ingo Steinke, web developer • Feb 25

Adapting boilerplate code is fine and valid, like create-react-app, only more generic. Our industry shouldn't have needed expensive LLM models to do that though. Debugging? AI can understand Tailwind and TypeScript, but a legacy web project from 2016? No chance, unless it's just boilerplat from ten years ago.

Matthew Hou • Feb 25

"Prototype coding" is a great way to put it. That's pretty much how I treat AI output now — it's a first draft that shows me the shape of a solution, not the solution itself. Especially useful when you're working with an unfamiliar API and need to see what the integration surface looks like before committing to an approach.

The key shift for me was stopping to expect production-ready code and starting to expect "good enough to learn from." Once you adjust that expectation, the frustration drops significantly.

signalstack • Feb 25

The 'attention redistribution' framing is the right diagnosis. Generation got cheap. Verification didn't.

I run a few AI models in production — parallel workloads, different models handling different tasks. The pull is always toward more: more agents, more parallelism, more throughput. But the real constraint doesn't change: how much cognitive load does it take a human to audit what came out?

A setup with three models producing clean, auditable outputs beats ten models producing plausible-but-questionable ones. Every time. The overhead compounds.

The point about expertise interfering with AI output is underappreciated. When you already have a strong mental model, a confident-but-wrong suggestion doesn't just waste time — it has to be actively rejected. That rejection costs more than silence would have. For a junior dev with weak priors, AI fills gaps. For someone who already knows the answer, it often adds noise you have to fight through.

The Dark Factory direction is the honest conclusion. You don't eliminate the human verification cost. You push it earlier, into test design and spec writing. Which is basically just the old TDD argument wearing new clothes.

Matthew Hou • Feb 25

The cognitive load point is the one most people skip over. "Just add more agents" sounds great until you're spending more time reviewing outputs than you saved generating them. I've hit that wall — at some point you realize the bottleneck was never typing speed.

And yeah, the TDD parallel is real. Writing good specs and test cases upfront is basically the same discipline, just reframed for a world where the machine writes the first draft. The skill shifts from "can I write this" to "can I define what correct looks like before anything runs."

Ingo Steinke, web developer • Feb 25

Where's the "dopamine hit" when AI generates 200 lines of code that should have been 20, hides at least one subtle bug within and adds five paragraphs of text and a desperate call to action, and when you pinpoint the error it utters verbose excuses, fixes the error and adds to other ones. This is just bullshit making me even more disappointed and angry when fellow coworkers insits that AI makes them "more productive". Hope this study will open their eyes!

Matthew Hou • Feb 25

Ha, you're describing a very real pattern. The verbosity is genuinely one of the most annoying things — you ask for a 5-line fix and get 80 lines of refactored code plus an essay explaining why.

I think the frustration your coworkers cause is actually a separate problem from the tool itself. The tool has real limitations. But "AI makes me more productive" and "AI makes me feel more productive" can both be true for different tasks and different people. The METR data just makes it harder to hand-wave away the gap between perception and measurement.

Matthew Hou • Mar 2

The context switching tax is underrated. I tracked my own workflow for a week and realized I was spending ~15 minutes per session just re-establishing context after switching between tools — which model knows about this codebase, which one I already gave the architecture doc to, where did I leave off. That's not AI being slow, that's me managing AI being slow. Consolidating the interface helps, but honestly the bigger win was just picking one tool per task type and sticking with it instead of shopping around mid-flow.

Gass • Feb 25

Don't get trapped in the weeds. Use AI as an assistant, not for writing code. Every issue related to skills degrading is related to AI coding for them. If you are programmer, program you lazy bastard. It will give you all you need: understanding of the project, contexts, practice, speed at typing, mental gymnastics. In every discipline professionals need to practice to improve or maintain skills, so don't give that practice to the machine. Is simple really.

Matthew Hou • Feb 25

The skills degradation angle is underrated. I've caught myself reaching for AI on things I used to just... do. And every time I did, the understanding got a little shallower.

That said, I don't think it's all-or-nothing. There are parts of coding where the practice builds understanding (architecture decisions, debugging, core logic) and parts where it's just mechanical repetition (config files, boilerplate wiring). I'm trying to be more deliberate about which category something falls into before deciding whether to hand it off.

david duymelinck • Feb 25

I read, the next.js rebuild from Cloudflare yesterday. And the part that struck me is their way of working. They define small tasks and let AI work on those.
This is concrete example of the AI is good at doing small things line I'm hearing in presentations.

So I guess spec driven AI is out and issue driven AI is in. Like you would do if you had a team of developers.

Matthew Hou • Feb 25

That Cloudflare post is a great example. "Small well-defined tasks" is exactly where AI shines — it's basically the same conclusion the METR study points to, just from the other direction.

"Spec driven AI is out, issue driven AI is in" — I like that framing. Treat AI like a junior dev who's great at executing clearly scoped tickets but terrible at interpreting a vague spec. The better your issue description, the better the output. Which is, like you said, the same workflow you'd use with a human team.

Vic Chen • Mar 4

The perception gap finding is genuinely unsettling, and your reframing of the workflow shift is the most honest take I've seen on it.

The Before/With AI comparison hits right. Building AI-powered financial data tools, I've seen the same dynamic — the bottleneck was never code generation speed, it's always been "did we correctly specify what we wanted." AI just makes the cost of a wrong spec hit faster.

Developers who are genuinely more productive with AI are the ones who write rigorous specs and tests before touching a prompt, not the ones who iterate fastest on output. The METR data suggests the industry is confusing velocity with throughput.

One wrinkle worth adding: the 19% slowdown might be underestimating the effect for domain-specific work. When the codebase has non-obvious invariants (financial regulations, edge cases in settlement calculations, etc.), AI-generated code fails in subtle ways that take longer to debug than the time saved writing. That's the real trap.

cognix-dev • Feb 25

The "redistribution" framing is exactly the right diagnosis. But I'd argue it's a symptom of a design problem: most AI coding tools are optimized for generation speed, not for reducing the human verification cost that follows.
That's what we tried to address with Cognix. Instead of asking "how fast can we generate code?", the design question was "how much human attention does verifying this output require?" Multi-stage validation, quality gates before the code reaches you — the goal is minimizing the attention tax, not just moving it somewhere else.
If the bottleneck is always human verification, the tool should be designed around that bottleneck.

Matthew Hou • Feb 28

"How much human attention does verifying this output require" is a better design question than most AI tool companies are asking. The generation speed race feels like it's hitting diminishing returns — the bottleneck moved downstream months ago. I haven't tried Cognix yet but the framing is right. The tools that win long-term will be the ones that make review faster, not generation faster.

cognix-dev • Feb 28

Thanks for your reply. Your feedback has given me courage. I'll implement the approach to improve human review speed more carefully!

Waqas Rahman • Feb 26

Lacking the "mental model" of your code/project really slow downs any debugging, fixing, and more specifically the possibilities of adding new feature. AI will kept on adding more files/functions for a feature where you could have guided it to use one already defined because you yourself dont have clear idea of your code.

Matthew Hou • Feb 28

This is one of my biggest frustrations. AI doesn't know your codebase has a perfectly good utility for exactly the thing it's about to reimplement from scratch. I've started including a "reuse these existing modules" section in my prompts, basically a mini architecture guide for the AI. It helps, but it's another thing you have to maintain. The dream is a tool that understands your codebase well enough to do this automatically.