Loading…

Are bugs and incidents inevitable with AI coding agents?

What specific kind of bugs is AI more likely to generate? Do some categories of bugs show up more often? How severe are they? How is this impacting production environments?

Article hero image
Credit: Alexandra Francis

SPONSORED BY CODERABBIT

Companies are looking to harness agentic code generators to get software built faster. But for every story of increased developer productivity or greater code base understanding, there’s a story about creating more bugs and the increased likelihood of production outages.

Here at CodeRabbit, we wanted to know if the problems people have been seeing are real and, if so, how bad they are. We’ve seen data and studies about this same question, but many of them are just qualitative surveys sharing vibes about vibe coding. This doesn’t show us a path to a solution, only a perception.

We wanted something a little more actionable with actual data. What specific kind of bugs is AI more likely to generate? Do some categories of bugs show up more often? How severe are they? How is this impacting production environments?

In this article, we’ll talk about the research we did, what it means for you as a developer, and how you can mitigate the mistakes that LLMs make.

What our research says

To find answers to our questions, we scanned 470 open-access GitHub repos to create our State of AI vs. Human Code Generation Report. We looked for signals that indicated pull requests were AI co-authored or human created, like commit messages or agentic IDE files.

What we found is that there are some bugs that humans create more often and some that AI creates more often. For example, humans create more typos and difficult-to-test code than AI. But overall, AI created 1.7 times as many bugs as humans. Code generation tools promise speed but get tripped up by the errors they introduce. It’s not just little bugs: AI created 1.3-1.7 times more critical and major issues.

The biggest issues lay in logic and correctness. AI-created PRs had 75% more of these errors, adding up to 194 incidences per hundred PRs. This includes logic mistakes, dependency and configuration errors, and errors in control flows. Errors like these are the easiest to overlook in a code review, as they can look like reasonable code unless you walk through it to understand it.

Logic and correctness issues can cause serious problems in production: the kinds of outages that you have to report to shareholders. We’ve found that 2025 had a higher level of outages and other incidents, even beyond what we’ve heard about in the news. While we can’t tie all those outages to AI on a one-to-one basis, this was the year that AI coding went mainstream.

We also found a number of other issues that, while they may not disable your app, were alarming:

  • Security issues: AI included bugs like improper password handling and insecure object references at a 1.5-2x greater rate than human coders.
  • Performance issues: We didn’t see a lot of these, but those that we found were heavily AI-created. Excessive I/O operations were ~8x higher in AI code.
  • Concurrency and dependency correctness: AI was twice as likely to make these mistakes, which include misuse of concurrency primitives, incorrect ordering, and dependency flow errors.
  • Error handling: AI-generated PRs were almost twice as likely to check for errors and exceptions like null pointers, early returns, and pro-active defensive coding practices.

The single biggest difference between AI and human code was readability: AI had 3x the readability issues as human code. It had 2.66x more formatting problems and 2x more naming inconsistencies. While these aren’t the issues that will take your software offline, they will make it harder to debug the issues that can.

Why errors happen with coding agents

Major errors happen largely because these coding agents are primarily trained on next token prediction based on large swaths of training data. That training data includes large numbers of open-source or otherwise unsecure code repositories, but it doesn’t include your code base. That is, any given LLM that you use is going to lack the necessary context to write the correct code.

When you try to provide that context as a system prompt or `agents.md` file, that may work depending on the LLM or agentic harness you’re using. But eventually, the AI tool will need to compact the context or use a sliding window strategy to manage it efficiently. At the end of the day, though, you’re dropping information. If you have a task list where the agent is supposed to create code, review it, and check it off when it's done, eventually it forgets. It starts forgetting more and more along the way until the point where you have to stop it and start over.

We’re past the days of code completion and cut and pasting from chat windows. People are using AI agents and running them autonomously now, sometimes for very long periods of time. Any mistakes—hallucinations, errors in context, even slight missteps—compound over the running time of the agent. By the end, those mistakes are baked into the code.

Agentic coding tools make generating code incredibly easy. To a certain degree, it's fun to be able magically drop 500 lines of code in a minute. You’ve got five windows going, five different things being implemented at the same time. No idea what any of them are building, but they're all being built right now.

Eventually, though, someone will need to make sure that code works, to ensure that only quality code hits the production servers.

Why AI code is so hard to review

There’s a joke that if you want a lot of comments, make a PR with 10 lines of code. If you want it approved immediately, commit 500 lines of code. This is the law of triviality: small changes get more attention than big changes. With agentic code generators, it becomes very easy to commit these very large commits with massive diffs.

Massive commits combined with hard-to-read code makes it very easy for serious logic and correctness errors to slip through. This is where the readability problem compounds. AI creates more surrounding harness code and little inline comments. There's just a lot more to read. Unless someone (preferably multiple someones) is combing through every single line of code on these commits, you could be creating tech debt at a scale not previously imagined.

Think of a code base over the lifetime of a company. Early-stage companies have a mentality of moving fast, getting your software out there, but maintainability, complexity, and readability issues compound over time. It may not cause the outage, but it will make that outage harder to fix. Eventually, that tech debt has to be paid off. Either the company dies or somebody has to rewrite everything because nobody can follow what any of the code is doing.

What you can do to stop errors

People want to use agentic coding tools and get the productivity gains. But it’s important to use them in a way that mitigates some of the potential downstream effects and prevents AI-generated errors from affecting your uptime. At every stage in the process, there are things you can do to make the end result better.

Pre-plan

Before starting out, do as much pre-planning as you can, and read up on the best practices for these tools. Personally, I like the trend of spec-driven development. It forces you to have a clearly laid out plan and thoroughly consider the requirements, design, and functionality of the end software that you want. This crystallizes the context that you have about the code into something the code generation agent can use. Add other pieces of context: style guidelines, documentation about the code base, and more.

Use the best LLMs for each task

While everyone wants to jump to the latest and greatest language models, we don’t believe you should let your users choose their own LLMs at CodeRabbit. Models are becoming very different, and by changing between LLMs, your prompts may not behave the same. The focus of the model may shift, it may generate more of certain types of error, or it may interpret existing prompts differently. Just because you know how to prompt one model, doesn’t mean you know how to prompt another. We recommend using a coding tool that benchmarks all the models and assigns the best one to the task you’re working on or reading benchmarks to better understand which to use for each task and how to prompt it.

Focus on small tasks

Once you start running the agent, smaller is better. Break tasks into the smallest possible chunks. Actively engage with the agent and ask questions; don’t just let it burn tokens for hours. On the flip side, create small commits that can be easily digested by your reviewers. People should be able to understand the scope of a given PR. The hype of long-running agents is a sales tactic, and engineers using these tools need to be clear-eyed and pragmatic.

Review AI-assisted PRs differently

When you approach a PR that AI assisted with, go in knowing that there will be more issues there. Know the types of errors that AI produces. You still need to review and understand the code like you would with any human-produced commit. It’s a hard problem because people don’t scale that well, so consider some tooling that catches problems in commits or provides summaries.

Leverage tools and systems to help

Your post-commit tools—build, test, and deploy—are going to be more important. If you have QA checklists, follow them closely. If you don’t have a checklist, make one. Sometimes just adding potential issues to the checklist will keep them top-of-mind. Review your code standards and enforce them in reviews. Instrument unit tests, use static analysis tools, and ensure you have solid observability in place. Or better yet, fight AI with AI by leveraging AI in reviews and testing. These are all good software engineering practices, but companies often neglect these tools in the name of speed. If you’re using AI-generated code, you can’t do that anymore.

Less haste, more speed

2025 saw Google and Microsoft bragging about the percentage of their code base that was AI-generated. This speed and efficiency was meant to show how productive they were. But lines of code has never been a good metric for human productivity, so why would we think it’s valid for AI?

These metrics are going to look increasingly irrelevant as companies take into account the downstream effects of their code. You’ll need to account for the holistic costs and savings of AI. Not just lines of code per developer, but review time, incidences, and maintenance load.

If 2025 was the year of AI coding speed, 2026 is going to be the year of AI coding quality.

Save your dev team’s sanity this year with better code review tools. Sign up for a 14-day CodeRabbit trial.


Add to the discussion

Login with your stackoverflow.com account to take part in the discussion.