pillar

The Real Economics of AI Coding Agents: ROI, Cost per Pull Request, and Where Local-First Wins

Frontier labs now ship more AI-written code than human-written code, but the viral ROI numbers are wrong. Here is the money math that survives CFO scrutiny.

PillarJune 11, 202620 min read
AI coding ROIcost per pull requestagentic AI market
The Economics of AI Coding Agents: ROI, Cost-per-PR, and the Local-First Edge

Engineers at Anthropic merge 10 to 30 AI-generated pull requests per day, and the company says 80-90% of its production code is now AI-authored. Meanwhile, the most rigorous randomized trial in the field found that experienced developers using AI tools were 19% slower on a mature million-line codebase, even though they believed they were faster.

Both of those things are true at the same time. That contradiction is the entire story of AI coding ROI in mid-2026, and the money math underneath it is far messier than either the vendor decks or the viral LinkedIn posts admit.

This piece works through the actual numbers: what a pull request costs in tokens, why the most-quoted ROI figures don't trace to real sources, what the agentic AI market is genuinely worth, and where a local-first stack quietly beats the cloud by 8-24× on cost.

Here's the one-paragraph answer for anyone skimming. The real cost per pull request for a frontier AI coding agent in 2026 is $1-$30 in raw tokens at list price, and $20-$80 all-in once review time and rework are counted, per Faros AI's engineering benchmarks. Realistic productivity gains sit near 10% at the median, not the 2-3× vendors promise.

TL;DR: Frontier labs now author most of their code with agents, and the agentic AI market sits near $10.2B in 2026. But the famous "4:1 ROI" and "$37.50 per PR" figures are misattributions, median productivity gains are closer to 10% than 200%, and code churn has nearly doubled. The leaders separating from the pack are winning on governance and cost engineering, not tool selection.

Key takeaways:

  • Token cost per agent PR: $1-$30 at list price. All-in cost with review and rework: $20-$80. The $37.50 figure is a token rate, not a PR cost.
  • Plan against 10% median productivity gains (DX), not the 60% upper-quartile headline.
  • Only ~11% of enterprises run agents in production despite 79% adoption, per Databricks' 2026 State of AI Agents report.
  • AI-assisted PRs merge at roughly half the rate of human PRs, per LinearB's 2026 benchmarks.
  • Self-hosted open-weights models break even against cloud APIs at 5-10M tokens per month, with an 8-24× per-token cost advantage.
  • Code churn nearly doubled and refactoring fell ~60% between 2023 and 2025. The quality bill is real and compounding.

What does an AI-generated pull request actually cost?

Start with the number everyone gets wrong. A figure of "$37.50 per incremental PR," usually attributed to Faros AI, has circulated widely in business cases this year. It doesn't hold up.

The most plausible origin is Anthropic's published pricing: $37.50 per million output tokens for Opus 4.6+ when context exceeds 200K tokens, a tier introduced in mid-2026 for long-context agent workloads. Drop the "per million tokens" denominator and you get a scary per-PR number that is off by roughly three orders of magnitude.

Run the actual math. An agentic code generation producing 2,000 output tokens at the $37.50/M rate costs about $0.075 in output tokens. Even a heavy long-context run producing 50,000 output tokens costs $1.88 in output alone.

The honest per-PR picture has four cost layers, and you need all of them:

Cost layer Typical range (mid-2026) Source
Raw tokens per agent PR (read, write, test, iterate) $1-$30 at Opus list price Anthropic pricing, Faros benchmarks
Median SWE-bench-style task $0.46 in tokens, ~8 min wall clock METR cost analysis
90th / 99th percentile task $3.20 / $22+ in tokens METR cost analysis
All-in cost per PR (tokens + plan + review time) $20-$80 Faros AI

The long tail matters more than the median. METR's distribution shows the 99th percentile task costing $22+ in tokens and four-plus hours of wall time. For a team running 50-100 agent PRs per week, that implies a compute budget of $500-$2,000 weekly in the median case, scaling to $5,000-$10,000 for high-complexity work.

That compute line is not a rounding error on the seat license. Budget it separately.

The 4:1 ROI claim doesn't trace to a source either

The companion claim, "4:1 ROI on Claude Code Max," also fails the citation test. It does not appear in a primary Faros AI publication. The closest real document, Faros's Measuring Claude Code ROI post, describes positive ROI without asserting that ratio.

And Faros's own AI Productivity Paradox report actively complicates it. The report found PRs merged up 98%, but review time up 91%, PR size up 154%, code churn up 91%, and bug volume up 9%.

Read those numbers together and the clean 4:1 story collapses. You shipped twice as many PRs, but each one is bigger, takes nearly twice as long to review, and the bug count went up.

If a vendor quotes you an ROI multiple, ask for the cohort definition, the time window, and whether review-time costs and rework are netted out. Most can't answer.

How big is the agentic AI market, really?

The headline figure for 2026 is roughly $10.21B, from a June 2026 Vantage Market Research report, projected to hit $388.30B by 2036 at a 43.8% CAGR. Treat the endpoint with deep suspicion.

The major analyst houses don't even agree on the near term. Precedence Research puts 2026 at $10.86B reaching $199.05B by 2034. Fortune Business Insights puts 2026 at $9.14B reaching $139.19B by 2034. MarketsandMarkets pegs 2032 in the $93-110B band.

That $59.9B spread between the 2034 endpoints isn't noise. Fortune counts only fully autonomous systems; Precedence counts AI assistants and tool-using LLMs too. Definitional drift, not measurement, drives the gap.

The 43.8% CAGR implies a 38× expansion in a decade. Only cloud computing (2008-2018) and the smartphone app economy (2009-2014) have managed anything comparable at scale. Plan against the 2030-2032 horizon where analysts actually cluster, which is roughly 5-10× the 2026 base. The 2036 number is a market-creation upper bound, not a base case.

The coding agent sub-segment is consolidating fast

Within that market, coding agents are the most commercially proven slice. Cursor (Anysphere) sits at roughly $2B ARR as of February 2026 with 7 million monthly active users.

Anthropic's run-rate revenue, often cited in the $10-14B range, implies coding agent revenue in the low single billions. GitHub Copilot crossed 1.8 million paid seats per Microsoft's Q2 FY2026 disclosure.

The more interesting signal is consolidation. GitHub's February 2026 changelog announced that Claude and Codex are now available to Copilot Business and Pro users, meaning Microsoft now resells its rivals' coding models inside its own product. The number of standalone pricing decisions you actually face is shrinking.

The 79% adoption, 11% production gap

The Databricks 2026 State of AI Agents report, drawn from telemetry across 20,000+ organizations including over 60% of the Fortune 500, found 79% of enterprises actively using AI agents but only ~11% with systems in production. Because it's telemetry rather than survey self-reporting, that 11% is an operational measurement, and it's the most honest planning baseline available.

Other numbers float around. IDC puts "full production" near 7%. McKinsey's State of AI 2025 reports 45% "regularly using agentic AI in at least one business function," a much softer bar. The 7%-87% range across surveys is definitional drift again. For coding agents handling real production work, anchor on the low end.

Gartner captured the schizophrenia perfectly in two of its own publications. An August 2025 release forecast 40% of enterprise apps would embed task-specific agents by 2026, up from under 5% in 2025.

A September 2025 Gartner survey found just 15% of IT application leaders would even consider piloting fully autonomous agents. High deployment intent, low production tolerance.

What separates the 11% from the rest? Per Databricks, 57% of organizations cite governance as the primary friction source, and organizations with mature AI governance are 12× more likely to get projects into production.

Security is close behind: 45% of enterprises reported an AI-related data leak in the past 12 months, and 67% of those leaks came from unapproved shadow tools rather than sanctioned ones.

The gap is not a tooling problem. It's governance, security, and integration work, and that work is the binding constraint on capturing any of the ROI discussed above.

What the productivity data actually shows

The bull case rests on real, independent numbers. DX's research documented 60% more PR throughput, a jump from 1.4 to 2.3 PRs per developer per week, and roughly 3.6 hours saved weekly, per its Q4 2025 impact report. Vinted reported a 58% PR throughput increase after deploying agents with DX's measurement framework.

But the same firm published the correction that matters more. DX's late-2025 analysis, "AI productivity gains: more modest than expected," concluded real-world gains sit closer to 10% than the 2-3× vendors promised. The 60% figure describes upper-quartile agent users. The 10% figure describes the median engineer.

That spread is the productivity paradox in a single statistic. A small set of high-leverage users pulls the average up while most of the org sees incremental improvement.

Then there's the quality drag. LinearB's 2026 benchmarks, drawn from thousands of organizations, found AI-assisted PRs merge at roughly half the rate of human-written ones. More PRs opened is not more PRs shipped.

The studies that survive scrutiny

Three pieces of research deserve a permanent slot in your planning deck.

The GitHub/Microsoft Copilot study (n≈4,800, one of the few large randomized samples) found a 55% increase in task completion on a controlled benchmark, with a smaller and more contested ~13.5% time reduction.

The 2025 DORA report found 90% of organizations now use AI in development, yet the core DORA metrics show only modest improvement for AI-heavy organizations, and AI adoption correlates with a small but measurable increase in change failure rate in some segments. Individual developers feel faster; system-level gains depend on redesigning review, testing, and deployment around the tool.

And the METR randomized controlled trial is the sharpest counterpoint in the literature. Sixteen experienced developers worked 246 tasks on a mature open-source repo (22,000+ stars, 1M+ lines), randomly assigned to use AI tooling or not.

The AI-assisted group was 19% slower, while believing they were faster. The authors attribute it to time spent prompting, reviewing, and reworking output that didn't match the codebase's conventions.

Sixteen developers and one repository bounds the generalizability. But it remains the strongest evidence that 2-3× claims don't survive rigorous measurement. The planning rule that falls out of all of this: expect 10-30% gains at the individual level, 5-15% at the team level, and treat the gap as a measurement problem to solve.

Frontier labs: when AI authored code becomes the majority

Against that sober backdrop, the frontier labs look like they live in a different universe. Anthropic has been the most public: executive statements and engineer testimonials place AI-assisted production code at 80-90%+ in mid-2026, and Claude Code's founders, Boris Cherny and Cat Wu, have said they personally merge 10-30 agent-generated PRs per day.

That's roughly 5-10× median senior-engineer throughput.

Google's Sundar Pichai said on the Q1 2026 Alphabet earnings call that "well over 30%" of new code at Google is AI-generated. Satya Nadella has put Microsoft's figure at 20-30%. Meta and OpenAI have stayed directional rather than numeric.

Two caveats before you put these numbers in a board deck. They're executive testimony, not independent measurement. And selection bias is doing heavy lifting: Anthropic engineers are, by construction, the heaviest Claude Code users on Earth, working in codebases built agent-first from the start.

Still, the directional claim is corroborated. Faros and LinearB telemetry both confirm a population of high-leverage users running tens of agent PRs per day. The labs are a preview of what the upper quartile looks like when governance, tooling, and culture all align. They are not a benchmark for your median team next quarter.

The quality bill: churn, refactor collapse, and security debt

The skeptical case is now as data-driven as the bullish one, and it's the binding constraint on production deployment.

GitClear's analysis of over 211 million changed lines found that between 2023 and 2025, code churn nearly doubled (lines changed within two weeks of commit rose from 3.1% to 5.7%) while the share of code classified as refactored fell from 24.1% to 9.5%, a roughly 60% drop. Copy/paste clone code rose 48%.

A CMU difference-in-differences study presented at ICSE 2025 found AI adoption correlated with 30% more static-analysis warnings and 40% higher cyclomatic complexity, controlling for repo, language, and experience.

Security tells the same story. Perry et al.'s study at IEEE S&P 2023, still the most-cited academic reference in the space, found developers using AI assistants wrote measurably more vulnerable code while being more confident it was secure. Veracode's generative AI coding reports found roughly 40% of generated snippets contained an OWASP Top 10 vulnerability.

Snyk reported a 67% year-over-year increase in AI-introduced vulnerabilities, with median time-to-fix 31 days longer than human-introduced bugs.

Connect the two threads and the structural risk is obvious. Agents produce code faster than humans can review it, and the refactoring that historically paid down debt has collapsed.

Whether this is a temporary adjustment phase or the new equilibrium is the open question of 2026. Don't bet your codebase on the optimistic answer; instrument churn and change failure rate as first-class metrics now.

Multi-agent orchestration: the 4× token multiplier

If you're moving from single-agent to multi-agent workflows, your token bill changes shape. Anthropic's own engineering write-up of its multi-agent research system documented a 4× token cost amplifier versus single-shot calls: a task costing ~$0.04 in one Sonnet call costs ~$0.16 routed through an orchestrator with sub-agents.

The amplification has two sources. The orchestrator issues its own planning, synthesis, and verification calls. And each sub-agent starts in a fresh context window with no compression benefit from prior turns.

Benchmark-scale data confirms the magnitude. Anthropic's SWE-bench performance write-up noted the full 500-problem suite cost roughly $1,900-$2,400 in API fees, or $3.80-$4.80 per task. Production agent runs that read code, write tests, execute, and iterate typically consume 200K-2M tokens per PR.

Three levers control this spend, per Anthropic's pricing page: model selection (Haiku 4.5 at $1/$5 per million in/out, Sonnet 4.5 at $3/$15, Opus 4.5 at $15/$75), context tier (200K+ context carries 2-4× rates), and caching (cached input at 10% of fresh cost, huge for review workflows that re-read the same files). Gemini 3 Pro and GPT-5.5 in Microsoft Foundry price in the same band, with GPT-5.5 at a 30% premium to the GPT-5 family.

The real budget killer is burstiness, not the median. A Max plan covers the typical day, but a single long-horizon run on a complex codebase can consume 5-20% of a month's plan budget in one session.

Faros telemetry shows Max 20× power users consistently hitting caps with $200-$500 monthly overages. Get usage caps with defined overage rates in writing before you sign.

What seats cost in mid-2026

Tool Tier List price Practical monthly cost per engineer
Claude Code Pro / Max 5× / Max 20× $20 / $100 / $200 $200-$400 (Max 5× + overage)
GitHub Copilot Business / Enterprise $19 / $39 per seat $40-$80
Cursor Pro / Business $20 / $40 $40-$80
Cognition Devin Team / Enterprise $20 / $500 $500-$1,500
Long-tail enterprise (Augment, Tabnine, Windsurf, etc.) Enterprise $30-$60 per seat varies, plus overages

The pattern: list price is the floor, and the gap between list and practical cost is your token consumption profile.

Where local-first agents beat the cloud

This is the most under-reported cost story of 2026. The open-weights model landscape caught up to the deployment reality: Qwen3-Coder-30B runs on a 24GB consumer GPU at 46-55% on SWE-bench Verified, Devstral 24B runs on a 16GB GPU at 46%, and gpt-oss-20B runs competitively on a 16GB laptop GPU.

The cost math is stark. A self-hosted H100 running an open-weights coding model achieves roughly $0.62 per million tokens of effective output, versus $15 per million for Claude Sonnet 4.5-class cloud output.

That's an 8-24× per-token advantage that breaks even against cloud APIs at around 5-10M tokens per month, a threshold any serious agent deployment clears easily.

The tooling is mature enough to be boring. Tabby gives you a self-hosted agent with IDE plugins. Aider's repo-map feature cuts token cost dramatically on large codebases. Continue.dev, Cline, Roo Code, and OpenHands cover the full agentic loop, all deployable air-gapped against on-prem Git and CI/CD.

The compliance argument may matter more than the cost one. For financial services, healthcare, defense, and government, "the model never sees your proprietary code" is the difference between a permissible deployment and a regulatory violation under the EU AI Act's general-purpose AI enforcement regime and the extended U.S. Executive order on AI safety.

The test-in-a-loop pattern is the cheapest optimization available

One workflow change cuts per-PR token cost by 30-60%: run tests locally inside the agent loop. The agent writes code, executes the test suite on local infrastructure at zero token cost, reads the failure output, and iterates. Aider, Claude Code, and OpenHands all support this natively.

A SQLite-embedded test database that resets across hundreds of agent iterations, plus Docker test environments that spin up in seconds, makes every iteration nearly free. Compare that to a cloud-only setup where every test cycle burns a paid model turn.

The sane architecture for most organizations is hybrid. Route the privacy-sensitive and high-volume-low-complexity 30-40% of the workload to local-first infrastructure, and keep frontier cloud models for the complex long-horizon work where capability gaps still justify the rate card.

What this means for you

If you own an engineering budget, here is the playbook the data supports.

Build the business case on per-PR token math, not viral ratios. The 4:1 ROI and $37.50-per-PR figures will not survive CFO scrutiny because they don't trace to primary sources. Use $1-$30 tokens per PR, $20-$80 all-in, and your own telemetry.

Discount vendor productivity claims 50-70%. DX, LinearB, DORA, and METR converge on 10-30% individual gains and 5-15% team-level gains. Plan against 10% and treat upside as found money.

Budget compute separately from seats. The long tail (the 10% of tasks driving 50% of cost) is what blows up forecasts. Negotiate hard caps with defined units and overage rates in the contract.

Plan against the 11% production baseline. The gap between adoption and production is governance work. Organizations with mature AI governance reach production 12× more often. That investment outperforms any tool swap.

Stand up local-first for the sensitive 30-40%. The 8-24× cost advantage and the compliance posture justify the engineering effort, and the break-even arrives at 5-10M tokens per month.

Instrument quality before the debt compounds. Put churn, refactor rate, and change failure rate on the same dashboard as PR throughput. The GitClear and CMU data are early-warning signals, and the labs that merge 90% AI-authored code can do so because their review and test discipline absorbed it first.

The agents are real, the spend is real, and the gains are real but smaller than advertised. The winners in 2026 aren't the teams with the best model. They're the teams that did the unglamorous math.

Sources

Frequently asked questions

How much does an AI-generated pull request actually cost?

Raw token cost runs $1-$30 per PR at Claude Opus list pricing depending on codebase size and agent turns, with METR measuring a $0.46 median for SWE-bench-style tasks. All-in cost, including review time and rework, lands at $20-$80 per PR for a typical Claude Code Max deployment per Faros AI benchmarks. The widely shared '$37.50 per PR' figure is a misreading of Anthropic's $37.50 per million output token long-context rate.

Is the 4:1 ROI claim for Claude Code real?

It does not trace to a primary Faros AI publication. Faros's own AI Productivity Paradox report found PRs merged rose 98% but review time rose 91%, PR size rose 154%, and bugs rose 9%, which undermines any clean 4:1 figure. Build business cases on documented per-PR token math instead.

What productivity gain should engineering leaders actually plan for?

Plan for roughly 10% at the median, per DX's late-2025 analysis, with 10-30% for individual contributors and less at the team level. The 60% throughput figures describe upper-quartile users, not the median. The METR randomized trial even found experienced developers were 19% slower with AI tools on a large mature codebase.

When do local-first coding agents beat cloud APIs on cost?

A self-hosted H100 running an open-weights coding model costs roughly $0.62 per million effective output tokens versus $15 for Claude Sonnet-class cloud APIs, breaking even at around 5-10M tokens per month. Models like Qwen3-Coder-30B and Devstral 24B now hit 46-55% on SWE-bench Verified on consumer GPUs, making local-first viable for the privacy-sensitive 30-40% of workload.