A model's score on a coding benchmark such as SWE-bench is a weak predictor of its reliability on tasks that have no automatic verifier; enterprise model-maturity assessment therefore has to be measured on a second axis that headline leaderboards do not capture, namely common-sense robustness, run-to-run consistency, and the model's willingness to flag and correct its own errors.

Anchored on three primary sources dated within the 19 Feb 2026 to 28 May 2026 window. (1) The car wash test (Felix Wunderlich, Opper.ai, 19 Feb 2026): 53 leading models given the prompt 'I want to wash my car. The car wash is 50 meters away. Should I walk or drive?'; correct answer is drive; 11 of 53 correct on a single run, 42 said walk; only five models (Claude Opus 4.6, three Gemini variants, Grok-4) correct across ten runs; GPT-5 seven of ten; human baseline 71.5% of 10,000 people. (2) Andrej Karpathy, Sequoia Ascent 2026 (30 Apr 2026): the verifiability thesis ('Traditional software automates what you can specify. LLMs and reinforcement learning automate what you can verify') and the 'jagged intelligence' framing, using the same car wash scenario as illustration. (3) Anthropic, Introducing Claude Opus 4.8 (28 May 2026): the model is around four times less likely than Opus 4.7 to let a flaw in its own code pass unremarked, flags its own uncertainty, and is the first model to break 10% on the Harvey legal-agent all-pass standard; GPQA Diamond slipped from 94.2% to 93.6% over the same release, evidence that no single benchmark moves monotonically. Scope: the claim is an observation about measurement, not a prediction that the un-verifiable-axis gap never closes and not a ranking of any vendor; the car wash test predates Claude Opus 4.8 and was not run on it (Opus 4.6 scored ten of ten in that study), so the claim must not be read as asserting that Opus 4.8 fails the test. Attribution guard: the systematic test is Opper's; Karpathy supplied the explanatory thesis; the two carry different authors and dates and should not be merged into 'Karpathy's car wash test'. 90-day review cadence (27 Aug 2026). Trigger conditions to revisit before next cadence: (a) a benchmark that directly measures the un-verifiable axis (common-sense robustness, run-to-run consistency, self-correction) becomes a standard headline metric across frontier releases, which would shift the claim toward 'leaderboards now capture it' and warrant a move to Partial; (b) a frontier release closes the car-wash-class common-sense gap to the human baseline with high run-to-run consistency, which would weaken the 'weak predictor' strength; (c) a vendor begins publishing run-to-run consistency and self-correction rates as standard release metrics, which would partially confirm the repricing observation in the article body. Siblings: AM-146 (/agentic-ai-accuracy-claims-task-baseline-methodology/, the procurement-disclosure companion on named task, baseline, and methodology) and AM-162 (/karpathy-anthropic-bench-not-org-chart/, the same wrong-axis error applied to measuring engineers rather than models).

Published

29 May 2026

Last reviewed

29 May 2026

Next review

+70d· 27 Aug 2026

Source piece

The Car Wash Test and the Measure of Model MaturityRead piece →

Primary sources

Permalink/holding/AM-187/

Embed this claimiframe + oEmbed

HTML iframe

<iframe src="https://agentmodeai.com/embed/claim/AM-187/" width="600" height="280" frameborder="0" scrolling="no" loading="lazy" referrerpolicy="strict-origin-when-cross-origin" title="AM-187: Holding — Agent Mode AI" style="border:0;max-width:100%;"></iframe>

Paste-the-URL (Substack, Medium, Notion, WordPress)

The card auto-updates when the claim's status, last-reviewed date, or correction log changes. Embedders never need to refresh — the card is rendered live from the canonical record.

Watch this claim

Email-me when AM-187's status, next review date, or correction log changes. One email per change. No newsletter subscription, no other mail.

The claim: A model's score on a coding benchmark such as SWE-bench is a weak predictor of its reliability on tasks that have no automatic verifier; enterprise model-maturity assessment therefore has to be measured on a second axis that headline leaderboards do not capture, namely common-sense robustness, run-to-run consistency, and the model's willingness to flag and correct its own errors.

About this register

The Reporting register tracks claims published from articles addressed to senior enterprise IT leaders — CIOs, IT directors, heads of platform. Claims are reviewed on a 30–90 day cadence; each review either reaffirms the claim, marks one substantive part as Partial, or marks it Not holding once the underlying evidence has been overtaken.

Recent corrections in Reporting

AM-008 · Partial · 17 Jun 2026
Source-text figure re-review: Google's 2024 Environmental Report reports a 28% year-over-year increase to 8.1 billion gallons, not the 33% (from a 6.1 billion 2023 base) asserted at publish. The 8.1B 2024 figure and the Microsoft WUE 0.30 L/kWh / 39%-improvement figure are unchanged and verified. Article corrected to 28% and the unsupported 6.1B base removed; the claim text retains the original figure with this correction per the Holding-up protocol.
AM-132 · Partial · 10 Jun 2026
One of four legs unanchored on re-review. The claim text attributes '12% of deployments clearing 300%+ ROI with 88% at or below break-even at 12-18 months' to the Stanford DEL 2026 Enterprise AI Playbook. Full-text verification on 10 Jun 2026 found no such figure in that source: the playbook (Pereira, Graylin, Brynjolfsson, Apr 2026) studies 51 successful deployments by design and contains no ROI distribution, no 300%-plus cohort, and no break-even measurement point (full finding at AM-029, correction of 10 Jun 2026). The only verified figure carrying the same 12/88 numerals is IDC research with Lenovo (via CIO.com, Mar 2025): roughly 88% of AI proof-of-concepts never reach production and roughly 12% graduate — a pilot-to-production graduation metric, not an ROI distribution. The Gartner 28%, McKinsey 23%/17%, and MIT NANDA 95% legs verify; they support a small high-performing tail and a large struggling body, but none documents the two-peak bimodal shape the claim asserts. Status Up -> Partial.
AM-129 · Partial · 10 Jun 2026
One of three read-against anchors unanchored on re-review. The claim text cites 'Stanford Digital Economy Lab Enterprise AI Playbook (12/88 bimodal ROI distribution at 12-18 months)' and frames the realistic ROI band around 'the highest-discipline 12% cohort'. Full-text verification on 10 Jun 2026 found the playbook contains no 12/88 distribution, no bimodal ROI shape, and no 12-18-month ROI measurement point (full finding at AM-029, correction of 10 Jun 2026). The claim's core negative finding — no mid-market enterprise has produced a documented +240% ROI in 90 days under audited conditions — is unaffected; the McKinsey State of AI 2025 and MIT NANDA legs verify and continue to support it. The '12% cohort' framing has no verifiable referent. The only verified figure carrying the 12/88 numerals is IDC's pilot-graduation finding (roughly 88% of AI proof-of-concepts never reach production; via CIO.com, Mar 2025), a different metric. Status Up -> Partial.

Reviews coming up in Reporting

AM-063 · Holding · next +9d (27 Jun 2026)
AI agents executing financial transactions need a four-control bundle (action-approval gates by blast radius, kill-swit…
AM-061 · Holding · next +9d (27 Jun 2026)
Production agentic-AI costs at scale routinely run multiples of POC projections, and a layered optimisation programme c…
AM-003 · Partial · next +9d (27 Jun 2026)
GPT-5 Pro's tiered-subscription model forces enterprises to classify problems by computational difficulty — $200/month…