A model's score on a coding benchmark such as SWE-bench is a weak predictor of its reliability on tasks that have no automatic verifier; enterprise model-maturity assessment therefore has to be measured on a second axis that headline leaderboards do not capture, namely common-sense robustness, run-to-run consistency, and the model's willingness to flag and correct its own errors.
Anchored on three primary sources dated within the 19 Feb 2026 to 28 May 2026 window. (1) The car wash test (Felix Wunderlich, Opper.ai, 19 Feb 2026): 53 leading models given the prompt 'I want to wash my car. The car wash is 50 meters away. Should I walk or drive?'; correct answer is drive; 11 of 53 correct on a single run, 42 said walk; only five models (Claude Opus 4.6, three Gemini variants, Grok-4) correct across ten runs; GPT-5 seven of ten; human baseline 71.5% of 10,000 people. (2) Andrej Karpathy, Sequoia Ascent 2026 (30 Apr 2026): the verifiability thesis ('Traditional software automates what you can specify. LLMs and reinforcement learning automate what you can verify') and the 'jagged intelligence' framing, using the same car wash scenario as illustration. (3) Anthropic, Introducing Claude Opus 4.8 (28 May 2026): the model is around four times less likely than Opus 4.7 to let a flaw in its own code pass unremarked, flags its own uncertainty, and is the first model to break 10% on the Harvey legal-agent all-pass standard; GPQA Diamond slipped from 94.2% to 93.6% over the same release, evidence that no single benchmark moves monotonically. Scope: the claim is an observation about measurement, not a prediction that the un-verifiable-axis gap never closes and not a ranking of any vendor; the car wash test predates Claude Opus 4.8 and was not run on it (Opus 4.6 scored ten of ten in that study), so the claim must not be read as asserting that Opus 4.8 fails the test. Attribution guard: the systematic test is Opper's; Karpathy supplied the explanatory thesis; the two carry different authors and dates and should not be merged into 'Karpathy's car wash test'. 90-day review cadence (27 Aug 2026). Trigger conditions to revisit before next cadence: (a) a benchmark that directly measures the un-verifiable axis (common-sense robustness, run-to-run consistency, self-correction) becomes a standard headline metric across frontier releases, which would shift the claim toward 'leaderboards now capture it' and warrant a move to Partial; (b) a frontier release closes the car-wash-class common-sense gap to the human baseline with high run-to-run consistency, which would weaken the 'weak predictor' strength; (c) a vendor begins publishing run-to-run consistency and self-correction rates as standard release metrics, which would partially confirm the repricing observation in the article body. Siblings: AM-146 (/agentic-ai-accuracy-claims-task-baseline-methodology/, the procurement-disclosure companion on named task, baseline, and methodology) and AM-162 (/karpathy-anthropic-bench-not-org-chart/, the same wrong-axis error applied to measuring engineers rather than models).
/holding/AM-187/Embed this claimiframe + oEmbed
The card auto-updates when the claim's status, last-reviewed date, or correction log changes. Embedders never need to refresh — the card is rendered live from the canonical record.
Email-me when AM-187's status, next review date, or correction log changes. One email per change. No newsletter subscription, no other mail.
The claim: A model's score on a coding benchmark such as SWE-bench is a weak predictor of its reliability on tasks that have no automatic verifier; enterprise model-maturity assessment therefore has to be measured on a second axis that headline leaderboards do not capture, namely common-sense robustness, run-to-run consistency, and the model's willingness to flag and correct its own errors.
About this register
The Reporting register tracks claims published from articles addressed to senior enterprise IT leaders — CIOs, IT directors, heads of platform. Claims are reviewed on a 30–90 day cadence; each review either reaffirms the claim, marks one substantive part as Partial, or marks it Not holding once the underlying evidence has been overtaken.
Recent corrections in Reporting
- AM-003 · Partial · 28 May 2026
Pricing/model drift: a $100/mo Pro tier now sits beside the $200 tier (added 9 Apr 2026) and the premium model is GPT-5.5 Pro. Core thesis holds; the single-$200-tier framing no longer matches. Re-verify current tiers at chatgpt.com/pricing.
- AM-002 · Not holding · 06 May 2026
URL state changed. The /the-agentic-ai-revolution-real-world-success-stories-and-strategic-insights-from-2024-2025/ slug now serves a deliberately rewritten retrospective (claimId AM-130, "Agentic AI 2024-2025 retrospective", published 04 May 2026) against audited primary sources. The 28 Apr 2026 redirect to /retractions/ has been lifted to allow that. AM-002 the claim remains Not holding — the original $3.50/dollar + 70% failure-rate framing was withdrawn and is not restored. AM-130 is a separate claim with its own evidence chain. Readers arriving at /holding/AM-002 see the withdrawal here; the article link surfaces the new piece at the URL the original lived at, with this entry as the audit trail.
- AM-121 · Holding · 2 May 2026
Klarna walk-back primary-source upgrade — added Siemiatkowski verbatim quotes via Bloomberg-cited-by-Fortune (9 May 2025) and the Uber-style freelance hiring detail via Entrepreneur. Closes the highest-priority evidence gap from the source dossier.
Reviews coming up in Reporting
- AM-136 · Holding · next +6d (4 Jun 2026)
Across the 24-month window May 2024 to April 2026, every major foundation-model provider (Anthropic, OpenAI, Google, AW…
- AM-020 · Holding · next +20d (18 Jun 2026)
The 40-60% TCO underestimate on enterprise agentic-AI deployments is not a cost-visibility failure — it is a cross-depa…
- AM-023 · Holding · next +20d (18 Jun 2026)
The 10 Apr 2026 Google AI Mode rollout to eight markets is the first vertical (restaurant booking) where agentic search…