DORA Report: AI Code Review Time Jumps 441% | NeuralWired

DevOps & Engineering

DORA Report: AI Code Review Time Jumps 441%

By the NeuralWired Engineering Desk · Updated July 2026 · 11 min read

Your team ships AI generated code faster than ever. Your review queue is where that speed goes to die. New data from Google’s DORA team and a 22,000 developer telemetry study from Faros AI both point to the same uncomfortable number: median time spent in code review is up 441.5% as AI adoption climbed, not down. If you’re an engineering leader who assumed AI code review would fix the bottleneck AI code generation created, the 2026 numbers say otherwise, and you need to see them before your next tooling decision.

In this article

The Real Numbers Behind AI Code Review
Why Review Time Is Exploding, Not Shrinking
The Benchmark Problem: Nobody Agrees What “Accurate” Means
What This Means If You Run an Engineering Org
The Contrarian Case: 19% Slower, Not 20% Faster
GitHub Code Quality’s July Launch: A Real Test Case
What to Actually Do About It
FAQ

The Real Numbers Behind AI Code Review

Every AI coding tool vendor is currently selling some version of the same promise: write code faster, review it faster, ship it faster. The generation half of that promise is real. The review half is where the story falls apart.

Faros AI’s “AI Engineering Report 2026: The Acceleration Whiplash” is the most current dataset available on this question. It draws on two years of telemetry from 22,000 developers across more than 4,000 teams, comparing each organization’s lowest AI adoption periods to its highest. The headline findings:

Median time to first PR review is up 156.6%
Average time spent in code review is up 199.6%
Median time in review overall is up 441.5%
AI code acceptance rate rose from 20% to 60%

Faros AI sells engineering analytics software built on DORA metrics, so treat this as vendor research with a stake in the outcome, not a neutral academic study. Still, the direction of the finding lines up with Google’s own 2025 DORA State of AI Assisted Software Development report, produced with GitHub and IT Revolution. DORA’s framing is that AI acts as an amplifier: it strengthens teams that already have solid engineering practices, and it exposes the weaknesses of teams that don’t. Roughly 90% of developers now use AI daily, according to the report, but nearly a third, 30%, say they have little to no trust in AI generated code.

That distrust has a name in the DORA report: the “verification tax.” Time saved writing code gets spent auditing it instead, and that tax lands squarely on reviewers.

A note on the “4.2 hours vs 90 seconds” claim you might have seen elsewhere. That comparison doesn’t hold up against any primary source we checked. Real human review times range from roughly 4 hours at Google internally to 3 to 5 days at typical enterprise teams, and AI review tools themselves range from about 30 seconds (GitHub Copilot) to several minutes for deep-index tools like Greptile. We’re using the sourced numbers above instead.

Why Review Time Is Exploding, Not Shrinking

Kent Beck, the creator of Extreme Programming and a co-author of the Agile Manifesto, put it about as bluntly as anyone in the industry has:

“We’re accumulating code faster than we are accumulating trust.” Kent Beck, “Trust Factory” newsletter, newsletter.kentbeck.com

That’s the whole problem in one sentence. AI generated code is, by multiple accounts, superficially convincing. It’s idiomatic. It’s well named. It reads like something a competent engineer wrote. Which is exactly why surface level review, the thing AI review tools are best at, becomes less useful over time: the bugs living in that code tend to be structural, not stylistic. Faros AI’s analysis makes the point directly, arguing that the engineers with the deepest system knowledge are the ones spending their most valuable hours unraveling plausible looking code that should never have reached them in that state.

An independent academic study published on arXiv in December 2024, still the most cited empirical study of its kind as of mid-2026, tested an LLM based automated review tool in real production repositories. Average PR closure time rose from 5 hours 52 minutes before the bot to 8 hours 20 minutes after, a statistically significant increase. Results varied by project. One project’s closure time dropped from 6 hours 6 minutes to 3 hours 7 minutes. Another rose from 20 hours 22 minutes to 30 hours 51 minutes. Roughly 73.8% of the tool’s comments were acted on, and developers reported a modest quality improvement, but the bot also introduced faulty reviews and irrelevant comments that added friction of its own.

Stack Overflow’s 2025 Developer Survey backs this up from the sentiment side: 84% of developers use or plan to use AI tools, yet 66% say their biggest pain point is AI output that’s “almost right,” and 45% say debugging AI generated code takes longer than debugging their own. Sonar’s State of Code 2026 survey of 1,149 developers found 96% don’t trust that AI generated code is functionally correct, but only 48% say they always review before committing. That gap between distrust and actual review discipline is worth sitting with.

The Benchmark Problem: Nobody Agrees What “Accurate” Means

Here’s the part that should worry anyone about to sign a contract with an AI code review vendor: the bug catch rate numbers those vendors publish don’t agree with each other, and they don’t agree with independent testing either.

Tool	Vendor-reported catch rate	Independent benchmark result
Greptile	82% (own 50-PR benchmark)	24% (Martian benchmark)
GitHub Copilot	54% (Greptile’s benchmark)	Not independently ranked in same test
CodeRabbit	44 to 51% (varies by benchmark)	46% (Macroscope’s ranking)
Cursor BugBot	Not separately vendor-reported	42% (Macroscope’s ranking)
Macroscope	Self-reported top performer	48% (its own ranking)

Source: Augment Code’s tool comparison, which flags the discrepancy directly, and buildmvpfast.com’s 2026 tool roundup. There is currently no independent, consensus benchmark for AI code review accuracy. Every number circulating in vendor decks was either run by the vendor or selected by the vendor. Treat any single “catch rate” claim as a marketing input, not a procurement fact, and run a short pilot against known bugs in your own codebase before you buy anything.

CodeRabbit’s own analysis of 470 pull requests, worth noting as vendor data about a competitor’s output rather than its own, found reviewers spend 91% more time reviewing AI generated code than human written code, with three times more readability problems and 75% more logic errors.

What This Means If You Run an Engineering Org

If you’re a VP of Engineering, a Director, or a staff engineer sitting on a tooling decision right now, here’s the shift that matters. Review is the bottleneck now, not code generation. If you’ve been measuring success by PRs merged or deployment frequency alone, you’re getting a misleading picture, because DORA’s and Faros’s data both show throughput metrics improving at the exact same time that stability metrics, change failure rate and rework rate especially, get worse.

DORA’s response to this was structural: the framework expanded from four metrics to five in 2024, adding “rework rate” specifically because AI driven throughput gains were making the old four-metric picture insufficient. That’s the metric to start tracking alongside deployment frequency, not instead of it.

GitHub’s own product team has landed on a position that’s becoming the de facto industry norm: a human always owns the merge button. From GitHub’s official blog:

“AI augments developer judgment; it can’t replace it.” GitHub Copilot code review product team, github.blog

The team’s interviews with developers found something specific worth stealing for your own workflow: running a Copilot self-review before opening a PR eliminated roughly a third of trivial back-and-forth comments. That’s the actual win available right now, catching the small stuff before a human ever sees the diff, not replacing the human’s judgment call on whether the change should exist at all.

Jon Wiggins, a machine learning engineer at Respondology, put the accountability question in plain terms:

“If an AI agent writes code, it’s on me to clean it up before my name shows up in git blame.” Jon Wiggins, ML Engineer, Respondology · via github.blog

Any team that’s dropped the human merge gate entirely should be treated as an outlier taking on real production risk, not a leading indicator of where the industry is headed.

The Contrarian Case: 19% Slower, Not 20% Faster

The single strongest piece of contrarian evidence in this entire dataset comes from METR, the nonprofit Model Evaluation and Threat Research group. Its randomized controlled trial, reported in MIT Technology Review, found experienced developers believed AI made them 20% faster. Objective measurement of the same developers found they were actually 19% slower.

That’s not a survey. It’s a controlled study, which makes it much harder to wave away than the productivity claims coming out of vendor marketing. Mike Judge, a principal developer at the software consultancy Substantial, described the gap between perception and reality from the inside:

“I was complaining to people because I was like, ‘It’s helping me but I can’t figure out how to make it really help me a lot.'” Mike Judge, Principal Developer, Substantial · via MIT Technology Review

Is the “AI review saves time” story realistic on a 2026 timeline? Not straightforwardly. The one controlled academic production study we found (the arXiv paper above) showed AI review increasing PR closure time. Any claim that AI review is a simple time saver needs that caveat attached, because in the best documented empirical test available, it wasn’t one.

GitHub Code Quality’s July Launch: A Real Test Case

There’s a genuinely useful stress test coming. GitHub Code Quality, the governance and quality gate product bundling CodeQL analysis with Copilot code review, moves from public preview to a paid, generally available product on July 20, 2026. More than 10,000 enterprises used the preview. Pricing lands at $10 per active committer per month on enabled repositories, plus usage based consumption for AI powered features like Copilot code review and Copilot Autofix.

Watch what happens to review time metrics at organizations adopting this over the next two quarters. If GitHub’s human-gated model actually closes the gap the data above describes, that’s the strongest real-world signal we’re likely to get all year.

What to Actually Do About It

Track rework rate and time-in-review, not just deployment frequency. A team that ships faster while rework climbs isn’t actually faster.
Keep a mandatory human merge gate. This is GitHub’s own stated product philosophy, not just an internal best practice.
Run AI self-review before human review, not instead of it. GitHub’s data shows this cuts trivial back-and-forth by roughly a third.
Pilot any review tool against your own codebase’s known bugs before trusting a vendor’s published catch rate.
Cap PR size. Multiple sources point to growing PR size, not tooling choice, as the actual driver of review slowdown.

FAQ

Does AI code review replace human code review?

No. Every credible source, including GitHub’s own product team, treats AI review as a first-pass filter for mechanical issues like typos and unused imports, while reserving architecture decisions and merge accountability for human reviewers. GitHub’s official position is that developers will always own the merge button.

How much time does AI code review actually save?

Results are mixed and contested. Vendor claims report time savings, but the most rigorous data, DORA’s 2025 report and Faros AI’s 2026 telemetry from 22,000 developers, found overall review time increasing, with median time-in-review up 441.5% as AI-generated code volume outpaced human review capacity.

What is the most accurate AI code review tool?

There is no independent consensus benchmark. Vendor-run tests and independent benchmarks disagree sharply. Greptile scores 82% bug-catch-rate on its own benchmark but 24% on the independent Martian benchmark. Pilot tools against your own codebase rather than trusting a published leaderboard.

Do AI code reviews catch more bugs than human reviewers?

AI reviewers are strong at mechanical pattern matching, things like missing awaits or unused variables, but consistently weaker at architectural and cross-file reasoning unless built around full-codebase indexing. No tool currently matches an experienced human reviewer’s judgment on whether a feature should exist at all.

Is AI-generated code more likely to have bugs than human-written code?

Yes, per multiple 2026 sources. CodeRabbit’s analysis of 470 pull requests found AI-generated code produced 75% more logic errors and three times more readability problems than human-written code, alongside rising bug rates as AI-code acceptance climbed from 20% to 60%.

Where This Goes Next

Here’s what the 2026 data actually tells you, stripped of the vendor gloss: AI hasn’t made code review faster. It’s made code review the bottleneck the rest of the pipeline is now waiting on, and the tools built to fix that problem haven’t closed the gap yet. Median time-in-review is up 441.5% at the same moment AI review tooling has proliferated across the industry. Those two facts sitting next to each other are the story.

Over the next 6 to 18 months, watch three things. First, whether GitHub Code Quality’s July 20 general availability launch actually moves review-time metrics at scale, since it’s the first major product to bundle static analysis and AI review under one governance umbrella with real enterprise adoption behind it. Second, whether an independent, non-vendor benchmark for AI review accuracy finally emerges, because right now buyers are flying blind. Third, whether DORA’s rework rate metric becomes standard practice at more organizations, since it’s currently the best early warning signal available for exactly the kind of quality debt this article describes.

Our read: this signals a market correction is coming for AI code review vendors who’ve been selling speed as the headline benefit. The winners over the next year will be the tools that reduce rework, not the ones with the flashiest catch-rate slide.

Want data-backed engineering and AI coverage like this in your inbox every week? Subscribe to The Neural Loop at neuralwired.com/newsletter.

Pinecone CEO Shakeup: pgvector Beats Pinecone in 2026

DORA Report: AI Code Review Time Jumps 441%

The Real Numbers Behind AI Code Review

Why Review Time Is Exploding, Not Shrinking

The Benchmark Problem: Nobody Agrees What “Accurate” Means

What This Means If You Run an Engineering Org

The Contrarian Case: 19% Slower, Not 20% Faster

GitHub Code Quality’s July Launch: A Real Test Case

What to Actually Do About It

FAQ

Does AI code review replace human code review?

How much time does AI code review actually save?

What is the most accurate AI code review tool?

Do AI code reviews catch more bugs than human reviewers?

Is AI-generated code more likely to have bugs than human-written code?

Where This Goes Next

Related Reading on NeuralWired

Related Post

Amazon Smart Building ROI: The Real Numbers for 2026

Amazon’s 1 Million Robots: The Real ROI Story (2026)

IBM Terraform vs Pulumi 2026: Who’s Really Winning?

Leave a Reply Cancel reply

You Might Missed

Amazon Smart Building ROI: The Real Numbers for 2026

UiPath vs RPA: Why Intelligent Automation Wins in 2026

NIST Quantum-Safe Encryption Standards: 2026 Guide

NVIDIA GR00T and the Rise of Physical AI Robots