The pillar guide

    How to measure the ROI of AI coding tools.

    You can see the AI tooling bill. You cannot see what it bought. Vendor dashboards report inputs (tokens, credits, acceptance rate, active seats), not outcomes. This guide lays out a repeatable method for measuring the realized Return on Code of every AI coding tool your team runs: what shipped, what lasted, what mattered, and what each realized change actually cost.

    Spend is visible.Value is not.
    Finance sees the invoice. Engineering sees the editor. Neither side can answer the only question that matters: of the code these tools generated, how much shipped, stayed in the tree, and moved a goal. The billing model shift makes the gap urgent. GitHub Copilot moved to usage-based, token-metered billing on June 1, 2026, and power users projected agentic bills jumping 10x to 50x. When the bill scales with usage, measuring what usage returns stops being optional.

    The method

    Five steps from a token bill to a defensible ROI number.

    The method moves from attribution to outcome to cost. Steps two through four are the three gates of Code Yield: Ship times Last times Matter, rolled as a product, not an average, so a strong gate cannot paper over a weak one. The fifth step turns that yield into money.

    Attribute AI-authored code per tool.

    Start by knowing which lines came from which tool. Codelitics captures this on the developer machine through git hooks and plugins for the AI tools in use, so a change is tagged with the tool that produced it before it ever reaches a pull request.

    Ship: did it reach the default branch?

    Generated code that never merges is cost without output. The first gate is whether an AI-authored change actually landed on the default branch, not whether it was accepted into a buffer or left in a draft.

    Last: is it still load-bearing at 30 and 90 days?

    Shipped is not the same as durable. Track the survival rate of each cohort of merged AI lines and watch its Code Half-Life: the time until half of it has been rewritten, reverted, or deleted.

    Matter: was it tied to a goal and incident-free?

    Code can survive and still be noise. The Matter gate asks whether a change mapped to a tracked goal and stayed clear of incidents and rollbacks, so churn that merely accumulates does not read as value.

    Cost: divide spend plus verification by realized changes.

    Add the tool spend and the verification tax, the human review and rework time spent making AI output safe to ship, then divide by the count of changes that shipped, lasted, and mattered. That ratio is your cost per realized change.

    Read the same five steps applied to one tool on the Claude Code, GitHub Copilot, and Cursor ROI pages.

    Worked example

    Carrying spend through to a cost per realized change.

    The numbers below are illustrative, not a measured benchmark or a customer result. They show the shape of the calculation. Say a 30-engineer team runs one AI coding tool for a quarter.

    • Example: tool spend. 30 seats with metered usage come to $90,000 for the quarter.
    • Example: AI-authored changes attempted. Plugins attribute 4,000 AI-authored changes started in the editor.
    • Example: Ship. 2,600 of those merge to the default branch (65%). The other 1,400 were spend with no output.
    • Example: Last. At 90 days, 1,820 of the merged changes are still load-bearing (a 70% survival rate on what shipped).
    • Example: Matter. 1,500 of the survivors were tied to a tracked goal and stayed incident-free. These are the realized changes.
    • Example: verification tax. Review and rework to make this output safe to ship cost roughly $30,000 of engineering time.

    Example: naive cost

    $22.50

    $90,000 in spend divided by the 4,000 changes attempted. This is the number a usage dashboard implies, and it counts work that never shipped or did not last.

    Example: cost per realized change

    $80.00

    $90,000 spend plus $30,000 verification tax, divided by the 1,500 changes that shipped, lasted, and mattered. The true cost is more than three times the naive figure, and now it is comparable across tools.

    The gap between $22.50 and $80.00 is the part vendor dashboards cannot show. The canonical definition of cost per realized change lives in the glossary.

    Why the usual numbers mislead

    Three popular metrics that do not measure return.

    Each of these is easy to read and easy to game. None of them tracks whether code survived and mattered, which is the only thing ROI can be built on.

    Acceptance rate

    It measures a keypress, not survival.

    Acceptance counts the moment a suggestion was kept in the editor. It says nothing about whether that code merged, survived review, or was still in the tree a month later. A high acceptance rate and a low survival rate can describe the same week.

    Token and credit dashboards

    They measure consumption, not outcome.

    Native dashboards report tokens spent, credits burned, and seats active. Those are inputs. Optimizing them rewards tokenmaxxing: running more generation to move a usage chart, with no link to whether anything load-bearing came out the other end.

    Perceived speedup

    The feeling is biased upward.

    Self-reported velocity is unreliable. In a controlled study, experienced developers were measured about 19% slower on real tasks while believing they were about 24% faster, a gap of more than forty points between perception and measurement.

    The perceived-speedup figure comes from a controlled study by METR, in which experienced developers were measured about 19% slower while expecting to be about 24% faster (METR, 2026). It is the clearest argument for measuring outcomes instead of trusting the feeling of speed.

    Measure AI coding ROI FAQ

    What leaders ask before they trust the number.

    How do I measure AI coding ROI?
    Measure outcomes, not inputs. Attribute AI-authored code per tool, then run each change through three gates: did it ship to the default branch, does it last at 30 and 90 days, and did it matter (tied to a goal and incident-free). Roll those into Code Yield, then divide tool spend plus the verification tax by the changes that cleared all three gates to get your cost per realized change. That ratio, tracked per tool and per team, is the return on your AI coding spend.
    What is a good survival rate?
    There is no published industry benchmark for AI-code survival yet, so we do not quote one. The useful comparison is internal: measure the 30 and 90 day survival rate of each tool and team against your own history and against each other on comparable work. A rate that drops sharply between 30 and 90 days signals churn that an acceptance-rate chart would never surface, regardless of any absolute number.
    Why are tokens and credits not a measure of ROI?
    Tokens, credits, and active seats are consumption, not outcome. They tell you how much you spent, not what survived. Optimizing them encourages running more generation to move a usage chart while nothing load-bearing ships. ROI requires connecting that spend to realized changes: code that shipped, lasted, and mattered. Codelitics reads the local AI activity (sessions, tokens, edit checkpoints) and the repository so the consumption number sits next to the outcome it bought.
    Do I need to change how my team works to measure this?
    No. Codelitics installs a per-seat agent on each developer machine (a CLI runtime, plugins for the AI tools you already run, git hooks, and a local database) and connects to your repositories through a GitHub App or GitLab OAuth. Developers keep using the same tools and the same branch workflow. You control which repositories and tools are in scope, and every figure is exportable and traceable to how it was computed.
    How do I compare two AI coding tools fairly?
    Stratify before you compare. The same tool can look strong on greenfield work and weak on a legacy service, so compare each tool on comparable cohorts: similar repositories, change sizes, and task types over the same window. Then read tool yield and cost per realized change side by side. Comparing a tool used for prototyping against one used on a critical path without stratifying produces a misleading verdict.
    Does the public Return on Code score rank individual developers?
    No. The conformant Return on Code score is computed at the team and tool level and never ranks individuals. Individual-level views do exist, but they are opt-in, sit outside the conformant score, and are governed by your own policy (for example GDPR or works-council agreements). The headline number for measuring AI coding ROI is always team and tool, not person.

    Private beta

    See what your AI coding spend actually ships.

    We install on one repo and show you how much of your AI code shipped, lasted, and mattered, with a cost per realized change you can trace.