Skip to content
Podcast · Episode 9 · 12:14

The three questions every CIO should ask about a vendor accuracy claim

Vendor "ready-to-run" agentic AI positioning is not procurement evidence unless three questions are answered: accuracy on which task, against which baseline, measured how. AM-146 walks the 2026 procurement-grade reference shapes — CRMArena-Pro, CMU TheAgentCompany, WebArena, SWE-bench Verified, and the Anthropic Claude for Chrome disclosure pattern.

Claims walked in this episode
  • AM-146 · Agentic AI accuracy claims: the three questions every CIO should ask before 'ready-to-run' becomes a procurement decision(Holding)
  • AM-009 · Claude for Chrome: what Anthropic's 23.6% to 11.2% prompt-injection numbers tell procurement(Holding)
  • AM-140 · The agentic AI pilot-to-production gap: what vendor 'successful pilot' references do not tell procurement(Holding)

ABBY

This is Agent Mode AI. I'm Abby. A vendor posted a launch this quarter positioning the product as ready-to-run. The phrase is procurement-deck noise unless three questions are answered. Today we're walking AM-146, the claim that vendor accuracy positioning without named task, named baseline, and named methodology is not procurement evidence regardless of how the rate is described in marketing.

AVERY

I'm Avery. Frame the three questions.

ABBY

Question one is task definition. A vendor claim of ninety percent accuracy against an undefined task class is not comparable across vendors and is not auditable against the procuring enterprise's actual deployment scope. Question two is baseline. A rate without a baseline has no operational meaning at the procurement-deck level. Question three is methodology. A measurement produced by the vendor's own marketing team on a non-public corpus is not the same procurement input as a measurement produced against a published academic benchmark by an independent research group. The three together describe the difference between marketing-grade and procurement-grade evidence.

AVERY

Start with question one. What does named task look like.

ABBY

The procurement-grade reference is the academic benchmark layer. Three benchmarks define the current standard. CRMArena-Pro, published by Salesforce AI Research in August 2025, measures frontier-class agents at approximately thirty-five percent multi-step reliability on a structured CRM task corpus. The corpus is published, the task definitions are documented, the rate is reproducible by external research groups. Carnegie Mellon's TheAgentCompany benchmark independently reproduces the thirty to thirty-five percent range on adjacent enterprise workloads, providing structural rather than corpus-specific confirmation. WebArena is the canonical browser-agent benchmark; frontier models complete approximately thirty-six percent of end-to-end web tasks in published evaluations. SWE-bench Verified is the canonical code-generation benchmark, and it is the benchmark vendors actually publish against when they want a procurement-credible accuracy claim.

AVERY

Why the convergence around the thirty to thirty-six percent range.

ABBY

The agents complete individual steps competently; the multi-step sequence drifts. The reliability ceiling is mechanism-level rather than corpus-level. Three independently-developed benchmarks reproducing the same range is the structural signal. The procurement-deck implication is that any vendor accuracy claim that does not reference at least one of these benchmarks, or a comparable named published methodology, is making a marketing-grade claim, not a procurement-grade one. A vendor whose disclosure says internal evaluation across our customer base, without naming the corpus, the customer-base composition, or the methodology, is publishing a number the procuring enterprise cannot reproduce, cannot compare to alternatives, and cannot defend at audit.

AVERY

Move to question two. The baseline.

ABBY

Four baselines are procurement-relevant. The human-worker baseline, the prior-model baseline, the no-AI baseline, and the competitor baseline. Same numeric rate, opposite procurement decisions depending on which one is named. An agent that completes a task at sixty percent reliability against a human baseline of ninety-five percent is not ready-to-run for that task; it is a draft-tier capability that needs human review before output goes to the customer. An agent that completes the same task at eighty-five percent against a human baseline of seventy percent, on a consistency-heavy data-entry workload where humans are bored and inconsistent, may be procurement-grade ready. The rate alone tells the procurement committee nothing.

AVERY

The prior-model baseline.

ABBY

This is where the Anthropic Claude for Chrome disclosure becomes the reference shape. Anthropic's published security disclosure on the launch reports a prompt-injection success rate of twenty-three point six percent pre-mitigation, eleven point two percent post-mitigation, and zero percent on URL-injection variants after subsequent patches, against a defined attack corpus, with the patch cadence documented. That is what a prior-version comparison looks like at procurement-grade depth. The trajectory is published, not just the snapshot. Most vendor positioning in 2026 publishes a snapshot rather than a trajectory; the procurement decision typically depends on the trajectory.

AVERY

The no-AI baseline.

ABBY

This is the baseline most procurement decks under-specify. The relevant comparison is not agent versus perfect, or agent versus human; it is agent-augmented workflow versus current-no-AI workflow against the procuring enterprise's own measurement methodology. An agent that completes a task at seventy percent reliability against a no-AI workflow that completes the same task at sixty-five percent reliability is procurement-grade. An agent that completes at ninety percent against a no-AI baseline of eighty-eight percent may not be, depending on the per-task cost. The rate is not the procurement decision; the rate against the named baseline is.

AVERY

The competitor baseline.

ABBY

This is the most procurement-deck-useful baseline because it directly informs the vendor selection decision. Vendor accuracy claims rarely include it because publishing comparative numbers across vendors is competitively unpopular. The academic benchmark layer is the substitute. CRMArena-Pro, WebArena, and SWE-bench Verified all publish multi-vendor evaluations against the same task corpus. That is the comparison vendor marketing will not publish on its own.

AVERY

Question three. Methodology.

ABBY

Five dimensions matter. Sample size, evaluation type, independence, time horizon, reproducibility. A ninety percent accuracy claim based on fifty evaluation cases is a different signal from a ninety percent claim based on five thousand cases. Red-team measurements, production-incident statistics, synthetic benchmarks, and human-rated comparisons are not interchangeable. A measurement produced by the vendor's research team is procurement-grade if the methodology is published and reproducible. A measurement produced by the vendor's marketing team without methodology disclosure is not. A rate measured against a model release several quarters back is not the same procurement input as a rate measured against the current release. Procurement-grade disclosure includes the measurement date.

AVERY

Reproducibility.

ABBY

Can an external party run the same evaluation and produce the same rate. The academic benchmark layer is reproducible by design. The vendor disclosure layer is reproducible only when the corpus is published. Vendor ready-to-run claims with non-public corpora are not reproducible and are not procurement-grade evidence. The Anthropic disclosure on Claude for Chrome sits at the high end of vendor-side procurement-grade methodology. The corpus is named. The methodology is described as red-team measurement against a defined attack corpus. The patch cadence creates a time series. The deploying enterprise can read the twenty-three point six, eleven point two, zero rates as procurement evidence because the methodology is disclosed at sufficient depth that an external party could in principle construct an equivalent corpus and reproduce the test.

AVERY

Three classes of evidence emerge.

ABBY

Reading the three questions against the 2026 evidence landscape, three classes carry different procurement weights. The academic-benchmark class — CRMArena-Pro, WebArena, SWE-bench Verified, CMU TheAgentCompany. External, reproducible, multi-vendor, named methodology. The vendor-disclosure class at procurement grade — the Anthropic Claude for Chrome pattern, published security and accuracy rates with named corpora, named methodology, and patch cadence. And the named-customer audited deployment class. McKinsey's internal Lilli platform, with approximately seventy-two percent adoption, five hundred thousand prompts monthly, and roughly thirty percent time savings over a six-month deployment. JPMorgan Chase's 2023 disclosure of one point five billion dollars in AI-attributable value across the two-hundred-thousand-employee LLM Suite. BT's Now Assist deployment, with thirty-five percent case-resolution improvement at random checks per Hena Jalil. UK Government Digital Service, with twenty-six minutes per day saved across twenty thousand staff in the fourth quarter of 2024.

AVERY

Why those four customer references count and most do not.

ABBY

Because the customer's own measurement methodology is named, the population is defined, and the time horizon is disclosed. Adoption metric, prompt volume, time savings, dollar value, case-resolution rate, daily-time-saved rate. Each named, each measured against a defined population. A vendor reference customer at this disclosure depth is procurement-grade evidence. A vendor reference customer that is named without comparable disclosure depth is closer to a marketing logo than to procurement evidence. The procurement-deck distinction is the disclosure depth, not the customer's name recognition.

AVERY

How does this connect to AM-140.

ABBY

AM-140 is the procurement-committee six pre-pilot questions. AM-146 adds three accuracy-disclosure questions on top. The pre-pilot question set asks whether the deployment will scale at the procuring enterprise; the accuracy-disclosure question set asks whether the vendor's rate claim is procurement evidence in the first place. Together they bound the procurement decision. A pilot authorised on a vendor accuracy claim that does not name task, baseline, or methodology is a pilot authorised on marketing evidence and will land on the McKinsey twenty-three percent scaling distribution at no better than the population rate.

AVERY

Final word.

ABBY

The three questions are auditable. Accuracy on which task, against which baseline, measured how. The CRMArena-Pro paper, the WebArena and SWE-bench Verified benchmarks, the CMU TheAgentCompany methodology, the Anthropic Claude for Chrome disclosure, and the named-customer audited references are all linked at agentmodeai dot com slash holding slash question mark claim equals A-M one four six. AM-146 is Holding. The next review is on the eighth of July 2026. The Sunday brief ships every week with what moved on the ledger.

AVERY

Holding-up. See you next Sunday.

Vigil · 33 reviewed