The standard

    The Return on Code standard.

    An open, vendor-neutral way to measure the realized return on AI-generated code: not what was produced, but how much of it shipped, survived, and was worth it. This is the framework behind Codelitics, written down so you can hold any AI-code measurement, ours included, to the same bar.

    Why it exists

    The bill is visible. What it bought is not.

    Teams now spend hundreds to thousands of dollars per engineer per month on AI coding tools. Every tool reports what it generates: suggestions offered, lines accepted, tokens burned. None reports what survived and was worth it. The vacuum gets filled with vibes.

    In a controlled study, experienced developers were measured about 19% slower on real tasks while expecting to be about 24% faster, and even afterward believed they had been sped up. Perception is not measurement. Source: METR. The Return on Code standard answers one question instead: of every dollar spent on AI coding, how much became durable, verified, goal-linked code, and how does that break down across the tools you pay for?

    The principles

    Six rules a measurement must respect to be conformant.

    A measurement standard lives or dies on whether the people being measured trust it. These rules are binding: a measurement that violates one is not Return-on-Code conformant.

    01

    Outcome over output

    Anchor on code that shipped, survived, and served a goal. Never on lines written, suggestions accepted, or pull-request counts.

    02

    Team-level, never individuals

    The conformant score aggregates to repository, team, and tool. It never ranks or scores individual developers. Individual-level views, where a tool offers them, sit outside the conformant score, are opt-in, and are governed by your own policy.

    03

    Cost-aware by construction

    Return on Code is not productivity. The denominator, including tokens, subscriptions, and the human cost of verifying AI work, is mandatory. A metric without cost is not conformant.

    04

    Triangulated and perception-corrected

    Self-report is provably biased, so an objective signal is always required. Perceived speedup may be reported only as a team-level gap against a measured anchor, never as a verdict.

    05

    Authorship-aware but privacy-safe

    Attribution of AI versus human work is required; surveillance is not. Measurement is aggregate, opt-in, and free of leaderboards.

    06

    Open and inspectable

    The formula is public and vendor-neutral. No black-box scores. Every figure carries its attribution method and a confidence level, so a number can be audited rather than taken on faith.

    The second rule is the one that makes the standard safe to adopt under GDPR and a works council: the public score is about teams and tools, not people. Individual-level views, where offered, are opt-in, sit outside the conformant score, and stay governed by your own policy.

    The headline metric

    Code Yield: ship, last, and matter, multiplied.

    Code Yield is the share of committed AI-authored code that ships, lasts, and matters, expressed as a rolled throughput yield: the product of three gate yields, not their average.

    Code Yield = Ship × Last × Matter

    Ship

    Did it make it in?

    The fraction of AI-authored committed lines that reach your default branch, rather than being abandoned in dead branches or reverted before merge.

    Signal: git history.

    Last

    Did it stay, and is it load-bearing?

    The fraction of shipped AI lines still present and active at 30 and 90 days. Persistence alone is not enough: dead code survives, orphaned code survives, code nobody dares touch survives longest. A line counts only if it is executed by tests, sits in a live file, is reachable from an entrypoint, or shipped in a release.

    Signal: git history, coverage, static reachability.

    Matter

    Was it for something, and did it not blow up?

    The fraction of durable AI lines tied to a real goal (a closed issue or shipped feature) and not implicated in a revert, hotfix, or incident within the window.

    Signal: issue links, release data, incident and revert history.

    Example (illustrative). A team commits 1,000 AI-authored lines in a week. 820 reach the default branch, so Ship is 0.82. 680 of those are still present and load-bearing at 30 days, so Last is 0.83. 540 are tied to a closed ticket and clean of reverts, so Matter is 0.79. Code Yield is 0.82 × 0.83 × 0.79 = 0.54, reported as 54 out of 100. Those same three gates averaged would read 81. The rolled product is what exposes the leak, and what makes the score hard to game by inflating any single stage.

    The cost layer

    Code Yield over cost: cost per realized change.

    Code Yield measures the quality of conversion. Return on Code requires putting it over cost. That denominator is what makes cross-tool comparison coherent at all: it collapses a seat-plus-usage tool, a token-priced tool, and a usage-based hybrid into one comparable figure, the cost per realized change.

    Cost has two tiers, and the public score uses only the first. The open tier is AI spend (tokens times model rate, plus allocated subscription) plus git-derived verification friction: fixup density and review round-trips, reported as friction indicators, not as invented hours. A second, modeled tier can layer on an actual verification-time estimate, but only where a tool captures human active editing time. It is labeled a model, computed with the customer, and never part of the public score.

    How it is reported

    Two levels, one durability metric, and a fair cross-tool view.

    Level 1, auditable

    The public, free score

    Code Yield plus cost per realized change. It counts only what happened in the repository and makes no assumption about the dollar value of output, so it is falsifiable. This is what gets published, scored, and quoted.

    Level 2, modeled

    The optional audit

    Assign a dollar value to realized output to get a return multiple. This requires assumptions, so it is computed with the customer, labeled a model, and never used as a public headline. Keeping it out of the free score is the defense against unfalsifiable vanity numbers.

    Companion metric

    Code Half-Life

    Run survival analysis on a cohort of AI-authored lines; the half-life is the time at which half of it has been rewritten or deleted. Reported per tool and per model. It measures duration, not return per dollar, so it is a diagnostic alongside Code Yield, not a substitute.

    Procurement view

    Cross-tool yield

    The reason a neutral standard exists: no vendor can compare across tools, because each sees only its own usage. A naive "Tool A beats Tool B" ranking is invalid, confounded by which work each tool was given. Conformance requires stratifying by task type, repo area, and seniority, and reporting confidence with every figure. The output is per-org: in your repo, for your work, which tool earns its seat.

    Guardrails

    What the standard refuses to do.

    • The public, conformant score never ranks or scores individuals. It aggregates to team, repository, and tool.
    • Individual-level views, where a tool offers them, sit outside the conformant score, are opt-in, and are governed by your own policy, under frameworks like GDPR and works-council agreements.
    • The perception gap is measured only at the team level, and only against an objective anchor. It is never a verdict on a person.
    • Measurement is opt-in. You control which repositories and tools are in scope, and every figure is exportable and traceable to how it was computed.

    Conformance and versioning

    What it takes to be Return-on-Code conformant.

    A measurement is conformant if it computes Code Yield as the rolled product of the three gates as defined, reports the observation windows it used (T+7, T+30, and T+90 days), states its attribution method and confidence, respects every principle above, and cites the standard version it follows.

    The standard is versioned. The metric set may evolve, as DORA went from four metrics to five; the version number, not the metric count, is the contract. This is draft v0.1: definitions are expected to sharpen before v1.0, and nothing here is settled until it survives scrutiny on real repositories.

    Return on Code FAQ

    The questions teams ask before they adopt it.

    What is Return on Code?
    Return on Code is the realized return on AI-generated code: how much of what you spent on AI coding tools became durable, verified, goal-linked code that shipped. It is a vendor-neutral, repo-local standard for measuring AI-code ROI, not a single tool's usage metric. The headline figure is Code Yield, set against a cost per realized change.
    Does the Return on Code standard rank individual developers?
    No. The conformant, public Return on Code score is reported at the repository, team, and tool level and never ranks individuals. A tool may offer individual-level views, but they sit outside the conformant score, are opt-in, and are governed by your own policy, including GDPR and works-council agreements. The standard is built to measure tools and workflows, not to score people.
    How is Code Yield calculated?
    As a rolled throughput yield: the product of three gate yields, Ship times Last times Matter, not their average. Because value leaks at every gate, three gates at 80% multiply to roughly 51%, not 80%. The rolled product is what exposes the leak and makes the number hard to game by inflating any single stage.
    Is the Return on Code standard open?
    Yes. The formula is public and vendor-neutral, with no black-box score. Every reported figure carries its attribution method and a confidence level, so it can be audited rather than taken on faith. The standard is versioned, and the metric set may evolve, as DORA's did, with the version number as the contract.
    What makes a measurement Return-on-Code conformant?
    It computes Code Yield as the rolled product of the three gates as defined, reports the observation windows it used (T+7, T+30, T+90 days), states its attribution method and confidence, respects every principle including the team-level rule, and cites the standard version. A measurement that ranks individuals in its public score, or that hides cost, is not conformant.
    How is Return on Code different from DORA?
    DORA measures software delivery performance in general (lead time, deploy frequency, change-fail rate, recovery). Return on Code measures the realized return on AI-generated code specifically, per tool, set against what it cost. They are complementary: DORA tells you how your delivery is trending; Return on Code tells you how much of your AI coding spend was actually worth it.

    Private beta

    Put a number on your Return on Code.

    We install on one repo and show you exactly how much of your AI code actually shipped, survived, and mattered.