Skip to content
Podcast · Episode 10 · 10:00

What vendor "successful pilot" references do not tell procurement

McKinsey State of AI 2025 measures twenty-three percent of enterprises scaling agentic AI, thirty-nine percent experimenting, thirty-eight percent with nothing in production or stopped. AM-140 walks the gap between vendor pilot references and procuring-enterprise scaled production, the documented Klarna, Salesforce Agentforce, and GitHub Copilot walk-backs that describe what those references typically obscure, and the six pre-pilot questions a procurement committee can require answered in writing before the contract closes.

Claims walked in this episode
  • AM-140 · The agentic AI pilot-to-production gap: what vendor 'successful pilot' references do not tell procurement(Holding)
  • AM-030 · The McKinsey 23%: the agentic AI scaling gap(Holding)
  • AM-128 · The MIT 95% GenAI-pilot-failure claim: what the State of AI in Business 2025 report actually measured(Holding)

ABBY

This is Agent Mode AI. I'm Abby. McKinsey's State of AI survey, published November 2025 with sample size one thousand four hundred ninety-one, measures twenty-three percent of enterprises scaling agentic AI, thirty-nine percent experimenting, thirty-eight percent with nothing in production or stopped. Today we're walking AM-140 — vendor "successful pilot" references presented at procurement-committee evaluation transfer to scaled production at the procuring enterprise at roughly the McKinsey twenty-three percent rate, and the gap is operational rather than capability-driven.

AVERY

I'm Avery. The procurement committee meets. The vendor's deck has seven named pilot references. Where does the gap come from.

ABBY

Vendor reference language is consistent across the major 2026 agentic AI sales motions. A successful pilot typically means the vendor's deployment team supported the implementation; the customer's pilot unit had above-average AI-readiness; the success metric was measured against the vendor's instrumentation rather than the customer's pre-deployment baseline; the time horizon was sixty to one hundred eighty days; the pilot did not run through a full audit, regulatory review, or change-of-leadership cycle. Each characteristic is procurement-relevant. The vendor team will not be embedded at the procuring enterprise's scale. The pilot unit's AI-readiness rarely matches the adjacent business units the scale-up will reach. The procuring enterprise's measurement regime is its own, not the vendor's.

AVERY

Three documented 2024 and 2025 walk-backs.

ABBY

Klarna's 2024 productivity narrative anchored on a seven-hundred-agent reduction figure that became a widely-cited reference in vendor decks. Bloomberg reported Klarna's reversal on the eighth of May 2025; the original press release stayed live, which is the procurement-relevant detail because the citation chain kept circulating unchanged. A procurement committee citing the Klarna deployment as a peer-class reference in mid-2025 was citing a number Klarna itself had walked back. Salesforce Agentforce's launch positioning implied broad customer adoption. The Salesforce IT division named a roughly two-hundred-customer figure in subsequent reporting through Q1 2026. A real number for an early-stage product, but materially smaller than the launch positioning implied. GitHub acknowledged a token-counting accuracy issue in the eighteenth of April 2026 changelog that affected billing and customer-side return-on-investment calculations derived from Copilot usage data.

AVERY

None of these are AI-does-not-work findings.

ABBY

Each is a the-headline-number-is-older-or-narrower-than-the-citation-suggests finding, which is exactly the class of issue a procurement committee should price into the evaluation framework. The vendor reference describes outcomes at the vendor's reference customer at the time of the reference. The procuring enterprise's outcomes will be measured by the procuring enterprise on its own cadence. The reference customer's pilot success and the procuring enterprise's scaled-production success are different events.

AVERY

What does the structural failure-mode evidence add.

ABBY

Three findings bound what is procurement-credible regardless of any single vendor reference. CRMArena-Pro, published by Salesforce AI Research in August 2025, measured frontier-class agents at roughly thirty-five percent multi-step reliability on a structured CRM benchmark. The agents complete individual steps competently; the multi-step sequence drifts. Carnegie Mellon's TheAgentCompany benchmark independently reproduces the thirty to thirty-five percent range on adjacent enterprise workloads. Both findings are mechanism-level, not incidental, which means they apply to the procuring enterprise's deployment regardless of the reference customer's pilot performance.

AVERY

The EchoLeak class.

ABBY

Common Vulnerabilities and Exposures identifier 2025-32711, disclosed in August 2025, named cross-agent prompt-injection where one compromised agent's output contaminates the input substrate of agents downstream in the workflow. Pilot deployments at the reference customer typically do not exercise the cross-agent attack surface; scaled production at the procuring enterprise will. Procurement committees that do not require cross-agent threat-model evidence are pricing a smaller risk than the deployment will face.

AVERY

The six pre-pilot questions.

ABBY

Question one. What is the procuring enterprise's pre-deployment baseline on the workflow the agent will own, measured by the procuring enterprise's own instrumentation, over four to six weeks before the agent goes live. Without this, the pilot's success cannot be evaluated against any meaningful comparison and the eventual scaling decision will rest on vendor-side numbers the chief financial officer cannot defend.

AVERY

Question two.

ABBY

What is the named owner of the agent's outcome at the procuring enterprise, with reporting line and accountability scope. Vendor references typically have a champion at the reference customer. The procurement committee needs the equivalent named at its own organisation, on the org chart, before the pilot starts.

AVERY

Question three.

ABBY

What is the agent registry the deployment will be added to, and what is the registry-entry's content. Pilots without registry entries cannot be scaled because the governance, security, and compliance teams cannot evaluate the scale-up against an inventory that does not exist. The first scaled-deployment review will surface the absence; better to surface it pre-pilot.

AVERY

Question four.

ABBY

What is the threat model for cross-agent delegation at the scale the procuring enterprise plans to operate, including the EchoLeak-class scenario. A pilot threat model that covers the pilot-unit attack surface is a different document from a scaled-production threat model. The procurement committee can require both.

AVERY

Question five.

ABBY

What are the contractual exit conditions, and have the data-portability and runtime-portability claims been tested rather than asserted. Vendor lock-in in agentic AI is an operating-cost issue, not just a procurement-clause issue. Tested portability is the difference between a ninety-day exit and a multi-year migration that exhausts the IT budget. AM-145 walks the seven exit-clause families that show up across enterprise master service agreements.

AVERY

Question six.

ABBY

What is the ninety-day, one-hundred-eighty-day, and three-hundred-sixty-five-day measurement plan the procuring enterprise will run on the scaled deployment, with named metrics, named owner, and named board-level review. A vendor reference describes outcomes at the reference customer at the time of the reference. The procuring enterprise's outcomes will be measured by the procuring enterprise on its own cadence; the procurement committee can require the cadence to exist before the contract closes.

AVERY

The GAUGE diagnostic.

ABBY

The six are operational preconditions, not contractual frills. The GAUGE diagnostic operationalises questions one, two, three, four, and six as a thirty-to-forty-five-minute working-group exercise the procurement committee can run with the vendor's deployment team in the room. Question five sits with legal rather than the diagnostic. Pilots scoring above fifty-five on GAUGE before the procurement decision are materially more likely to enter the twenty-three percent scaling cohort. Pilots scoring below forty enter the thirty-nine percent experimenting cohort or the thirty-eight percent deployed-and-stopped cohort. The free GAUGE diagnostic and the working-group spreadsheet for governance teams sit at agentmodeai dot com slash gauge.

AVERY

The procurement implication.

ABBY

A procurement committee unable to obtain answers in writing before the pilot is making a procurement decision on the same evidence base McKinsey's thirty-nine percent experimenting cohort started with. The McKinsey distribution is the prior; the answers move the deployment toward the twenty-three percent scaling cohort or away from it. The committee approving on vendor references without those preconditions is pricing the transfer rate as if it were one hundred percent, which is the most common 2026 enterprise procurement mistake the data describes.

AVERY

Final word.

ABBY

The McKinsey State of AI 2025 survey, the Bloomberg Klarna reversal, the Salesforce Agentforce customer-adoption reporting, the GitHub Copilot token-counting changelog, the CRMArena-Pro and CMU TheAgentCompany benchmarks, and the EchoLeak common vulnerabilities entry are all linked at agentmodeai dot com slash holding slash question mark claim equals A-M one four zero. AM-140 is Holding. The next review is on the fifth of July 2026. Cadence is sixty days, with the next review window aligned to the McKinsey 2026 wave whenever it publishes.

AVERY

Holding-up. See you next Sunday.

Vigil · 33 reviewed