The CMU TheAgentCompany 2026 benchmark figure (30.3% task completion for best-in-class frontier model, up from 24% in 2024) is the current capability constraint for enterprise agentic AI. Capability trajectory projects to ~40% by late 2027, which does not cross the 95% production-readiness threshold within the 3-year TCO horizon enterprise business cases operate against. The Stanford DEL 12% durable cohort operates within the 30.3% (narrow scope + human-in-the-loop + GAUGE-dimensional governance discipline), not around it. Capability is not the variable that separates the 12% from the 88%.

Re-review 10 Jun 2026: capability legs verified. 30.3% full-completion top score (Gemini 2.5 Pro, 175-task set, 39.3% with partial credit) per TheAgentCompany paper v2 (arxiv 2412.14161v2); the 24% 2024 baseline (Claude 3.5 Sonnet) per v1. No frontier model found above 50% on the benchmark, so watch (1) has not fired. The Stanford DEL 12%-durable-cohort leg failed primary-source verification: the cited Enterprise AI Playbook contains no 12/88 cohort (see AM-029 correction, 10 Jun 2026). Extended same day after the full exposure-map investigation (docs/editorial/stanford-1288-exposure-map-2026-06-10.md): the 12/88 has no primary source anywhere as an ROI or cohort distribution — it is the IDC/Lenovo pilot-graduation finding (roughly 88% of AI proof-of-concepts never reach production, roughly 12% graduate; via CIO.com, Mar 2025) fused with the Stanford DEL's name and an invented methodology. Any future re-anchor of the cohort-behaviour leg must use the IDC graduation metric or drop the cohort framing. The capability-constraint reading stands; the cohort-behaviour leg is unanchored. Status Up -> Partial. Watches unchanged: (1) frontier model crossing 50%, (2) capability-wait vs governance-discipline equivalence studies, (3) benchmark refresh shifting task distribution.

Published

24 Apr 2026

Last reviewed

10 Jun 2026

Next review

+37d· 25 Jul 2026

Source piece

The CMU 30.3%: the enterprise agent capability gapRead piece →

Primary sources

Correction log

10 Jun 2026One leg unanchored on re-review. The CMU capability figures verify cleanly (30.3% full completion for Gemini 2.5 Pro and 39.3% partial-credit on the 175-task TheAgentCompany set per paper v2; 24% for Claude 3.5 Sonnet in the Dec 2024 v1). The Stanford DEL '12% durable cohort' referenced in the claim text does not exist in the cited source: the Enterprise AI Playbook (Pereira, Graylin, Brynjolfsson, Apr 2026) studies 51 successful deployments and contains no 12/88 ROI cohort (full finding at AM-029, correction of 10 Jun 2026). The claim's capability-constraint argument holds on its own evidence; the sentence tying the constraint to the 12%/88% cohort behaviour has no verifiable referent. Status Up -> Partial.

Permalink/holding/AM-031/

Embed this claimiframe + oEmbed

HTML iframe

<iframe src="https://agentmodeai.com/embed/claim/AM-031/" width="600" height="280" frameborder="0" scrolling="no" loading="lazy" referrerpolicy="strict-origin-when-cross-origin" title="AM-031: Partial — Agent Mode AI" style="border:0;max-width:100%;"></iframe>

Paste-the-URL (Substack, Medium, Notion, WordPress)

The card auto-updates when the claim's status, last-reviewed date, or correction log changes. Embedders never need to refresh — the card is rendered live from the canonical record.

Watch this claim

Email-me when AM-031's status, next review date, or correction log changes. One email per change. No newsletter subscription, no other mail.

The claim: The CMU TheAgentCompany 2026 benchmark figure (30.3% task completion for best-in-class frontier model, up from 24% in 2024) is the current capability constraint for enterprise agentic AI. Capability trajectory projects to ~40% by late 2027, which does not cross the 95% production-readiness threshold within the 3-year TCO horizon enterprise business cases operate against. The Stanford DEL 12% durable cohort operates within the 30.3% (narrow scope + human-in-the-loop + GAUGE-dimensional governance discipline), not around it. Capability is not the variable that separates the 12% from the 88%.

About this register

The Reporting register tracks claims published from articles addressed to senior enterprise IT leaders — CIOs, IT directors, heads of platform. Claims are reviewed on a 30–90 day cadence; each review either reaffirms the claim, marks one substantive part as Partial, or marks it Not holding once the underlying evidence has been overtaken.

Recent corrections in Reporting

AM-008 · Partial · 17 Jun 2026
Source-text figure re-review: Google's 2024 Environmental Report reports a 28% year-over-year increase to 8.1 billion gallons, not the 33% (from a 6.1 billion 2023 base) asserted at publish. The 8.1B 2024 figure and the Microsoft WUE 0.30 L/kWh / 39%-improvement figure are unchanged and verified. Article corrected to 28% and the unsupported 6.1B base removed; the claim text retains the original figure with this correction per the Holding-up protocol.
AM-132 · Partial · 10 Jun 2026
One of four legs unanchored on re-review. The claim text attributes '12% of deployments clearing 300%+ ROI with 88% at or below break-even at 12-18 months' to the Stanford DEL 2026 Enterprise AI Playbook. Full-text verification on 10 Jun 2026 found no such figure in that source: the playbook (Pereira, Graylin, Brynjolfsson, Apr 2026) studies 51 successful deployments by design and contains no ROI distribution, no 300%-plus cohort, and no break-even measurement point (full finding at AM-029, correction of 10 Jun 2026). The only verified figure carrying the same 12/88 numerals is IDC research with Lenovo (via CIO.com, Mar 2025): roughly 88% of AI proof-of-concepts never reach production and roughly 12% graduate — a pilot-to-production graduation metric, not an ROI distribution. The Gartner 28%, McKinsey 23%/17%, and MIT NANDA 95% legs verify; they support a small high-performing tail and a large struggling body, but none documents the two-peak bimodal shape the claim asserts. Status Up -> Partial.
AM-129 · Partial · 10 Jun 2026
One of three read-against anchors unanchored on re-review. The claim text cites 'Stanford Digital Economy Lab Enterprise AI Playbook (12/88 bimodal ROI distribution at 12-18 months)' and frames the realistic ROI band around 'the highest-discipline 12% cohort'. Full-text verification on 10 Jun 2026 found the playbook contains no 12/88 distribution, no bimodal ROI shape, and no 12-18-month ROI measurement point (full finding at AM-029, correction of 10 Jun 2026). The claim's core negative finding — no mid-market enterprise has produced a documented +240% ROI in 90 days under audited conditions — is unaffected; the McKinsey State of AI 2025 and MIT NANDA legs verify and continue to support it. The '12% cohort' framing has no verifiable referent. The only verified figure carrying the 12/88 numerals is IDC's pilot-graduation finding (roughly 88% of AI proof-of-concepts never reach production; via CIO.com, Mar 2025), a different metric. Status Up -> Partial.

Reviews coming up in Reporting

AM-063 · Holding · next +9d (27 Jun 2026)
AI agents executing financial transactions need a four-control bundle (action-approval gates by blast radius, kill-swit…
AM-061 · Holding · next +9d (27 Jun 2026)
Production agentic-AI costs at scale routinely run multiples of POC projections, and a layered optimisation programme c…
AM-003 · Partial · next +9d (27 Jun 2026)
GPT-5 Pro's tiered-subscription model forces enterprises to classify problems by computational difficulty — $200/month…

Referenced within Agent Mode AI by · 1 piece

The CMU 30.3%: the enterprise agent capability gap