DEV Community

I Stopped Reviewing Code: A Backend Dev’s Experiment with Google Gemini

Ashley Childress on March 04, 2026

This is a submission for the Built with Google Gemini: Writing Challenge 🦄 I’ve been officially obsessed with AI for nearly a year now. Not from ...

Read full post

Alois Sečkár • Mar 4

Whenever I skip the verification phase of AI-generated code, it backfires almost immedeately. For me it is invaluable tool to move forward it stuck or to start with something you dont know how/dont want to do, but I am less and less confident in trusting any output. If nothing else, the proposed code is almost always unnecessarily bloated.

markdown • Mar 6

I **agree **with you!

Ashley Childress • Mar 4

I agree on this, too. In prod-level code I'm almost always asking myself how it's possible it figured that was a good idea! It definitely was not.

theScottyJam • Mar 7 • Edited

For static pages, I think it's probably fine to use an LLM without doing a personal review of the code, as long as you do thorough manual testing, make sure it loads at a good speed, follows good accessability standards, etc.

For anything else, I would never let LLMs run loose - I'd be too scared of it introducing security vulnerabilities or desasterous bugs (such as dropping database data), and I would be responsible for any damage it caused.

Ashley Childress • Mar 7

One important thing to note that this is not a production system, which changes the game entirely! This project is a personal playground designed to test these sorts of limits. In a real prod environment, I completely agree with you!

That being said, in the future this becomes more and more possible. This particular problem is already being addressed today with things like CodeQL and Sonar scans. Thorough tests beyond the standard unit/integration suites are also fast becoming a baseline requirement.

The question is not whether or not AI can handle the job, but what do we need to do as engineers to teach it how to do so properly?

ReRoutd Admin • Mar 8

Appreciate the honest write-up. This mirrors what a lot of backend/platform teams in the US are seeing: AI can accelerate review prep, but not replace human accountability.

The line about tests validating implementation instead of behavior is the key risk. We’ve had better outcomes when teams require:

contract tests against real integration boundaries
mutation testing for critical paths
and “human sign-off” gates for auth, billing, and data-deletion code

Curious if you tracked defect escape rate before/after this experiment? That metric usually makes the business case clear fast.

Ashley Childress • Mar 8

I didn’t explicitly track it this time around, which in hindsight feels like an obvious thing I should probably start doing. Noted for the next build—thanks!

For v2, I leaned pretty heavily on Claude and the Sonar MCP. After a rather aggressive cleanup pass (and an extra GHA scan for Sonar), most of the bigger issues are now caught ahead of time. I'm still working out the best way to make the review pass more reliable and automatic though.

The higher-reasoning models are doing a lot of the heavy lifting when it comes to getting anything close to quality output. One thing I’m pretty convinced of now is that running multiple adversarial reviews, each with different LLMs, should help a lot. That’s next on my list to experiment with.

Christie Cosky • Mar 8

"AI-generated tests can pass because they were written to satisfy the implementation, not challenge it."

I discovered the same thing earlier this year when using AI to write unit tests: the tests mirrored the code instead of validating it. Everything passed, even when the implementation was actually wrong.

I wonder if using TDD would result in better outcomes, but I haven't tried it yet myself. It's a concept I've read about, but have had a hard time figuring out how to put into practice.

Ashley Childress • Mar 8

I tested this some early on, but AI really ended up writing tests that were either incomplete or the code was written satisfy the tests and not functionality. You need an adversarial component of some kind to challenge the "quick solution". Most LLMs are trained to find the quickest correct path, which is rarely the accurate one.

Christie Cosky • Mar 8

We have some job security for now then :D

In addition to manual verification of tests, I also have a Claude skill that is checks each unit test's correctness. When I generate them, they all follow a specific method name pattern:

<methodName>_when<Conditions>_<expectedBehavior>

Then I have a Claude skill check that the method under test matches the first part of the pattern, that the condition setup matches the second part of the pattern, and that the assertions match the expected behavior from the last part of the pattern. It actually does find problems this way.

Ashley Childress • Mar 8

Thanks—that’s a helpful observation. I may adopt a similar pattern for a gap in my own testing while I work on improving my local implementation flow for Claude.

You might also want to take a look at Verdent. It’s a higher-cost option, but one of the more complete implementations I’ve tested so far. It automates several of the manual setup steps involved in this process. 😃

That said, most tools still require a decent amount of customization before this type of automation becomes practical in a production environment. It’s likely feasible long term, though I expect it to shift development responsibilities away from traditional implementation work toward configuring and guiding AI systems. Other factors—particularly operating cost—will likely influence how quickly broader AI implementation progresses.

In the meantime, it's definitely fun to experiment with!

Robert Cizmas • Mar 4

Hi, Ashley! Great post! I invite you to test etiq.ai, I think you'll love it. It is an integrity layer for AI generated code and you can visualise your pipeline, debug fast and test different lines of code.