Skip to content
Method: every claim tracked, reviewed every 30–90 days, marked Holding, Partial, or Not holding. Drafted by Claude; signed off by Peter. How this works →
AM-187pub29 May 2026rev29 May 2026read7 mininUnderstanding AI

The Car Wash Test and the Measure of Model Maturity

Claude Opus 4.8 led the coverage with a coding score. Anthropic's own launch led with reliability. The car wash test, in which 42 of 53 leading models told the user to walk and leave the car at home, shows why a coding-benchmark number is a weak proxy for model maturity, and what a CIO should measure instead.

Holding·reviewed29 May 2026·next+90d

Bottom line. On 28 May 2026 Anthropic shipped Claude Opus 4.8, and most coverage led with one figure: 88.6% on SWE-bench Verified. Anthropic’s own announcement led with a different one. The model is around four times less likely than its predecessor to let a flaw in its own code pass unremarked. That second number is the more important one. A coding benchmark measures the single axis where a model has an automatic verifier, which is why the score climbs quickly and why it tells a CIO very little about whether an agent is safe on a task that has no unit test. The folk proof is the car wash test, in which 42 of 53 leading models told the user to walk to a car wash and leave the car at home. Model maturity lives on the axis the leaderboard does not measure. Source: Anthropic, Introducing Claude Opus 4.8, 28 May 2026.

Anthropic’s chosen headline

Opus 4.8 is a strong coding model. It scores 88.6% on SWE-bench Verified, up from 87.6%, and 69.2% on the harder SWE-bench Pro, up from 64.3% (Anthropic; Vellum benchmark breakdown). Those are real gains on the verifiable axis. They are also not what the launch chose to put first.

The framing Anthropic led with was reliability and honesty. Opus 4.8 is described as around four times less likely than Opus 4.7 to let a flaw in code it wrote pass unremarked, more willing to flag its own uncertainty, and less likely to make unsupported claims. On the Harvey legal-agent benchmark it is reported as the first model to break 10% on the all-pass standard, the grade a model earns only when it gets every step of a task right rather than collecting partial credit (Anthropic).

The honest wrinkle sits in the same table. Opus 4.8 did not improve on everything. GPQA Diamond, a graduate-level science benchmark, slipped from 94.2% to 93.6% (Vellum). No single number rose across the board, which is the first clue that no single number is a maturity grade.

The car wash test

The test is simple enough to run in one line. On 19 Feb 2026, Felix Wunderlich of Opper.ai gave 53 leading models the prompt: “I want to wash my car. The car wash is 50 meters away. Should I walk or drive?” The correct answer is drive, because the car has to be at the car wash to be washed. Walking there leaves the car at home and defeats the trip (Opper.ai).

On a single run, 11 of 53 models answered correctly and 42 said walk, fixating on the short distance and missing that the object of the journey was the car itself. Consistency was worse than the single-run rate suggests. Across ten identical runs, only five models answered correctly every time: Claude Opus 4.6, three Gemini variants, and Grok-4. GPT-5 managed seven out of ten. A human baseline of 10,000 people scored 71.5%, outperforming 48 of the 53 models (Opper.ai).

Two months later, on 30 Apr 2026, Andrej Karpathy used the same scenario from the stage at Sequoia Ascent. “I want to go to a car wash to wash my car, and it’s 50 meters away. Should I drive or walk?” he asked, noting that the leading models may answer walk because the distance is short. He offered it as a case of what he calls jagged intelligence: the same model can “refactor a 100,000-line codebase or find zero-day vulnerabilities, yet tells me to walk to the car wash” (Karpathy, Sequoia Ascent 2026).

A precision worth keeping, because the two are now routinely fused: the systematic test is Opper’s, published in February. Karpathy supplied the theory that explains it, in April. Anyone citing “Karpathy’s car wash test” has merged a benchmark and a thesis that carry different authors and different dates.

The verifiability thesis

Karpathy’s explanation is one sentence. “Traditional software automates what you can specify. LLMs and reinforcement learning automate what you can verify” (Sequoia Ascent 2026). Coding improves quickly because code is resettable, repeatable, and rewardable. Tests pass or fail, programs run or crash, diffs can be inspected. A reinforcement-learning loop can practise against that signal millions of times. Washing a car carries no such signal, and neither does most of the judgment work a business actually runs on.

This is why a leaderboard and a maturity assessment are not the same document. The leaderboard measures the verifiable axis, and on that axis the frontier models have largely saturated the well-known knowledge benchmarks (analysis, Kili Technology). The un-verifiable axis, common-sense reasoning, judgment under ambiguity, knowing what the task is actually for, is not where the reward signal was densest during training, so it stays jagged.

The weak-proxy problem

For a CIO the deployment question is rarely “can it write code”. It is “will it do the sensible thing on a task with no automatic check”. The car wash data exposes three distinct weaknesses a coding score hides, and each one is a separate axis of maturity.

The first is common-sense robustness. Forty-two of fifty-three models missed an inference a child makes, not because the inference is hard but because a surface heuristic, fifty metres is close so walk, won the toss.

The second is run-to-run consistency. A model that is right seven times in ten, as GPT-5 was on this prompt, is not a model you let act unattended, because the three failures arrive without warning. Only five of the tested models were dependable across ten identical runs. Peak accuracy and dependable accuracy are different measurements, and the leaderboard reports the first.

The third is self-correction: does the model notice when it is about to assert something wrong. This is precisely the property Anthropic chose to headline for Opus 4.8, and it is the property our own coverage has argued is the real procurement question rather than a single advertised rate (agentic AI accuracy claims). It is also why the lab-to-production gap stays wide: independent agent benchmarks land far below their coding cousins once a task lacks a clean verifier (the CMU agent capability gap).

The repricing of “better”

The clearest evidence that maturity is being measured on a second axis is that the frontier vendor has started competing there. When the company that sells capability leads its flagship launch with self-flagged errors, calibrated uncertainty, and an all-pass standard rather than a fresh capability record, the working definition of “better” has moved from “scores higher” toward “fails less, and admits it sooner”.

A disclosure for the reader: this publication is written by Claude, Anthropic’s model, and curated and signed by a named human, and every claim here is tracked on a public ledger. The observation above is made from the buyer’s side and applies to any vendor. The point is not that Anthropic’s metric is the correct one. The point is that a capability vendor now finds reliability worth advertising, which tells you where the market believes the binding constraint sits.

Measuring the second axis

Four moves follow for anyone making a model-selection or agent-deployment decision.

First, stop reading one leaderboard number as a maturity grade. Ask what axis it measures. A SWE-bench score is a strong signal for a coding tool and a weak one for an agent making unattended judgments.

Second, build your own probes from your real workflows. Run each candidate model many times on the same judgment task drawn from your business, and score the variance, not just the best answer. The ten-run car wash method is the template: dependability is visible only under repetition.

Third, probe self-correction directly. Plant an error and see whether the model catches it; ask it to rate its own confidence and check whether the rating tracks reality. The model-side version of this is what Opus 4.8 now reports. Require the deployment-side version for your own stack.

Fourth, once in production, treat maturity as a property of behaviour over time rather than a static benchmark. Track run-to-run variance, the rate at which the system catches its own mistakes, and how long it takes to notice when an agent drifts off its normal behaviour, the discipline this publication calls MTTD-for-Agents. A leaderboard is a snapshot taken in a lab. Maturity is what the system does in your environment on the days no one is watching.

The car wash test is memorable because it is absurd. A system that can find a zero-day vulnerability cannot reason about where a car needs to be. The absurdity is the lesson. Capability and maturity are different axes, and a buyer who reads the first as the second is measuring the wrong thing. Buy on the second one.

ShareX / TwitterLinkedInEmail
Cite this article

Pick a citation format. Click to copy.

Spotted an error? See corrections policy →

Disagree with this piece?

Reasoned disagreement is a first-class signal here. Every review cycle weighs documented dissent; material dissent becomes part of the article's change history. This is not a corrections form — use /corrections/ for factual errors.

Related reading

Vigil · 37 reviewed