Is AI-generated code worse than human-written code?

Not categorically, and the honest answer is more useful than a yes or no. The same studies that flag problems also report real gains: in Sonar's survey 93% of developers said AI had at least one positive impact on their technical debt, and DORA found AI lifts individual productivity and job satisfaction. The narrower, truer reading is that AI raises the volume of code, and a measurable share of that extra volume is slower to ship, more duplicated, or more likely to be reworked. The risk is not that AI writes bad code, it is that velocity-only metrics hide the share that does not survive.

What is the survival gap in AI-generated code?

The survival gap is the distance between the code an AI tool generates, which usage dashboards and acceptance rates count, and the code that is still doing useful work in your repository weeks later. A token is billed at the moment of generation. Whether that line shipped, survived the next refactor, and stayed clean of reverts is a separate fact the meter never records. Codelitics measures that fact directly as survival rate and Code Half-Life.

Do studies really show AI makes developers slower?

One rigorous study does, in a specific setting. METR ran a randomized controlled trial with experienced open-source developers on mature codebases and found they took 19% longer with AI tools, despite predicting a 24% speedup. It does not prove AI slows everyone everywhere. It is strong evidence that the feeling of speed is not the same as speed, which is why felt productivity is not enough to justify the spend on its own.

What should I measure instead of acceptance rate or velocity?

Measure what survived. Acceptance rate counts a suggestion the instant it is taken, and velocity counts activity. Neither tells you whether the code shipped, lasted, and mattered. Return on Code rolls those three questions into Code Yield, a product of Ship times Last times Matter, and turns spend into a cost per realized change, so a calm bill with terrible yield finally looks as alarming as it should.

Why don't AI coding vendors publish code survival data?

Every vendor in the category sells the tool that generates the code, so their dashboards are built around generation, acceptance, and adoption, the numbers that rise when you use more. Survival is the one number that can fall while usage rises, which is why it is rarely the headline. That is also why a survival measurement worth trusting is an independent, repo-local one rather than a figure from the tool you are evaluating.

The data

The AI code that didn't survive.

The whole industry sells AI coding on one promise: speed. Every tool ships a usage dashboard with a line climbing up and to the right. But when independent researchers actually measured the output, they kept finding the same thing the dashboards never show: a large share of AI-generated code is slower to ship, more duplicated, heavier with debt, and more likely to break or be rewritten.

Call it the survival gap: the distance between the code an AI tool generates, which is the only thing the meter counts, and the code that is still doing useful work in your repository weeks later. Below are four findings, from four independent sources, that map the gap. None of them are ours. None of them come from a company that sells an AI coding tool.

The Return on Code standard·Code Half-Life·A token dashboard tells you the bill, not the yield

In a controlled trial, developers were 19% slower with AI while predicting they would be 24% faster, and still believed it had sped them up afterward.

Source: METR randomized controlled trial, 2025. The feeling of speed is not speed. That gap is the whole reason a usage dashboard cannot tell you whether the spend is working.

Four findings, one missing number

Four independent studies, four symptoms of the same blind spot.

Each measures a different stage of the lifecycle, from writing the code to shipping it. Read together they describe one thing no vendor dashboard reports: what happens to AI-generated code after the tokens are billed.

19% slower

with AI, while predicting 24% faster

A randomized controlled trial of 16 experienced open-source developers across 246 real issues on large, mature repositories (averaging 22,000+ stars and over a million lines). Half the issues allowed AI tools, mostly Cursor Pro with Claude, half did not. Screens were recorded.

The developers forecast a 24% speedup. They came in 19% slower, and even afterward still believed AI had sped them up by about 20%. Felt velocity and measured velocity pointed in opposite directions. A productivity argument built on how fast the work feels is built on the wrong instrument.

Source: METR, 2025

12.3%

of changed lines copy-pasted, up from 8.3%

GitClear analyzed 211 million changed lines of code from January 2020 to December 2024, classifying each change as cloned (copy-pasted) versus moved (refactored and reused).

Copy-pasted lines climbed from 8.3% to 12.3% of changes between 2021 and 2024, while refactored code fell from 25% of changed lines to under 10%. The codebase is accreting duplicates instead of composing reuse, and duplicated code is precisely the kind that gets rewritten or removed later.

Source: GitClear AI Copilot Code Quality, 2025

88%

say AI added to their technical debt

Sonar's State of Code developer survey asked developers about AI's effect on their technical debt. (To be fair to the data, 93% also reported at least one positive impact.)

88% reported at least one negative impact on technical debt, and 53% pinned it on AI generating code that looked correct but was not reliable. Generated is not the same as reliable, and the difference is a bill that arrives later, after the tokens are long since paid.

Source: Sonar State of Code, 2025

7.2% less

delivery stability per 25% more AI adoption

Google's DORA 2024 Accelerate State of DevOps report modeled how AI adoption relates to software delivery performance across thousands of respondents.

DORA estimated that every 25% increase in AI adoption was associated with a 7.2% drop in delivery stability (and a 1.5% drop in throughput), which it traced to larger batch sizes. More code, shipped in bigger chunks, more of it breaking. The input went up and the outcome went down.

Source: DORA (Google), 2024

To be fair

This is not an argument against AI coding. It is an argument against measuring it by volume.

The same sources that flag the problems also report real wins. In Sonar's survey 93% of developers said AI had at least one positive impact on their technical debt. DORA found AI lifts individual productivity, flow, and job satisfaction. Plenty of AI-generated code ships, survives, and earns its keep, and the teams that pair it with strong review and testing do measurably better.

So the contrarian claim here is narrow and defensible: AI raises the volume of code, and a measurable share of that extra volume does not survive. The mistake is not using AI. The mistake is reading a usage curve as if it were a return curve. Spend goes up on the dashboard whether the code lives or dies, so a rising line tells you nothing about whether the money is working.

The leak you can't see

You are paying per token and per seat for output, and a measurable share of it never survives.

Your usage dashboard shows none of it. Codelitics shows the share of your AI code that shipped, survived, and mattered, per tool.

The missing number

Every study is circling the same gap. None of them close it, because closing it means measuring survival.

Slower to ship. More cloned. More debt. Less stable delivery. Four different symptoms, one missing measurement underneath all of them: nobody is tracking what share of AI-generated code is still alive and load-bearing weeks after it was written. Codelitics exists to measure exactly that, and the vocabulary for it already has a name.

Return on Code rolls the survival question into three gates. Code Yield is Ship times Last times Matter: did the line reach the branch, was it still there in 30 to 90 days, and was it clean of reverts and tied to a goal. Code Half-Life measures how long a cohort of AI-authored code lasts before half of it is rewritten or deleted. Put those next to your bill and you get a cost per realized change instead of a cost per token.

This is the difference between an input metric and an outcome metric, the same point made at length in a token dashboard tells you the bill, not the yield. Acceptance rate and usage are inputs. Survival is the outcome. The studies above are four years of evidence that the two have quietly come apart.

Velocity is a feeling, survival is a number

Stop reading a usage curve as if it were a return curve.

Codelitics puts a survival number next to your AI coding spend, repo-locally, across every tool. We install on one repo.

How you measure what survived

Survival is a repo-local fact, so it has to be measured in the repo, not in the meter.

A usage dashboard sees the moment of generation and never looks again. Survival lives in the git history afterward, so that is where Codelitics reads it. A per-seat agent records which AI tool authored which lines as the work happens, and the dashboard clones your in-scope repositories through a GitHub App or GitLab OAuth to follow those lines forward: what merged, what was still there weeks later, and what got reverted or rewritten.

That attribution is what the four studies could not do at the level a budget owner needs. Because the agent tags code by tool at the time it is written, the same survival math can be reported per tool, so you can see whether Cursor, Claude Code, or Copilot is the one earning its seat. It is the one number that can fall while your usage rises, which is exactly why no tool you are evaluating will show it to you.

How to measure AI coding ROI·AI spend governance·Benchmark your AI coding

Per tool, after merge

See which AI tool's code actually survives, and which one is quietly leaking your budget.

Codelitics attributes code per tool and tracks it after merge, so renewal is a number, not a guess. Start with one repo.

By tool

Already spending on one tool? Start the survival question there.

The survival gap is tool-neutral, and so is the measurement. If most of your AI spend runs through one assistant, start here:

Claude Code ROI·GitHub Copilot ROI·Cursor ROI

AI code survival FAQ

What people ask about AI code that doesn't survive.

Is AI-generated code worse than human-written code?: Not categorically, and the honest answer is more useful than a yes or no. The same studies that flag problems also report real gains: in Sonar's survey 93% of developers said AI had at least one positive impact on their technical debt, and DORA found AI lifts individual productivity and job satisfaction. The narrower, truer reading is that AI raises the volume of code, and a measurable share of that extra volume is slower to ship, more duplicated, or more likely to be reworked. The risk is not that AI writes bad code, it is that velocity-only metrics hide the share that does not survive.
What is the survival gap in AI-generated code?: The survival gap is the distance between the code an AI tool generates, which usage dashboards and acceptance rates count, and the code that is still doing useful work in your repository weeks later. A token is billed at the moment of generation. Whether that line shipped, survived the next refactor, and stayed clean of reverts is a separate fact the meter never records. Codelitics measures that fact directly as survival rate and Code Half-Life.
Do studies really show AI makes developers slower?: One rigorous study does, in a specific setting. METR ran a randomized controlled trial with experienced open-source developers on mature codebases and found they took 19% longer with AI tools, despite predicting a 24% speedup. It does not prove AI slows everyone everywhere. It is strong evidence that the feeling of speed is not the same as speed, which is why felt productivity is not enough to justify the spend on its own.
What should I measure instead of acceptance rate or velocity?: Measure what survived. Acceptance rate counts a suggestion the instant it is taken, and velocity counts activity. Neither tells you whether the code shipped, lasted, and mattered. Return on Code rolls those three questions into Code Yield, a product of Ship times Last times Matter, and turns spend into a cost per realized change, so a calm bill with terrible yield finally looks as alarming as it should.
Why don't AI coding vendors publish code survival data?: Every vendor in the category sells the tool that generates the code, so their dashboards are built around generation, acceptance, and adoption, the numbers that rise when you use more. Survival is the one number that can fall while usage rises, which is why it is rarely the headline. That is also why a survival measurement worth trusting is an independent, repo-local one rather than a figure from the tool you are evaluating.