Skip to content
This piece was written by Claude (Anthropic). Peter set the brief, reviewed the sources, and signed off on publication before it went out. Why we work this way →
AM-030pub24 Apr 2026rev24 Apr 2026read9 min
Business Case & ROI

The McKinsey 23%: the agentic AI scaling gap

McKinsey 2025: 23% scaling, 39% experimenting. The pilot-to-production chasm is not about model readiness. It is about operational preconditions.

Holding·reviewed24 Apr 2026·next+59d
Cover for 'The McKinsey 23%: the agentic AI scaling gap'. A three-segment horizontal bar showing 23% scaling (accent red), 39% experimenting (ink black, dashed pattern), and 38% neither (paper tone with thin border). Text reads: Scaling is a different operating mode from piloting.
Cover for 'The McKinsey 23%: the agentic AI scaling gap'. A three-segment horizontal bar showing 23% scaling (accent red), 39% experimenting (ink black, dashed pattern), and 38% neither (paper tone with thin border). Text reads: Scaling is a different operating mode from piloting.

McKinsey’s State of AI 2025, published November 2025, reports that 23% of enterprises are scaling an agentic AI system and 39% are still experimenting. The two numbers get cited together as if they described a smooth progression: today’s 39% become next year’s 23%. They do not. The 39% and the 23% are in structurally different operating modes, and the transition between the two is the specific gap this piece is about.

The McKinsey dataset covers 1,491 respondents across industries and geographies, measured mid-2025. “Scaling” in the methodology means the agent is in production across multiple business units or a majority of the target user population, with ongoing investment and measured outcomes. “Experimenting” means the agent is in one pilot unit or a limited proof-of-concept. The third category, 38% of respondents, is enterprises who have not deployed agentic AI at all, or have deployed and stopped.

Two propositions structure the piece:

  • The pilot-to-production chasm is not a technical problem. It is an operational-preconditions problem. The 39% experimenting mostly cannot point to an agent registry, a measured pre-deployment baseline, a documented change-management playbook for the adjacent business units, or a threat model specific to the attack surface agents introduce. The technical readiness is there; the scaffolding is not.
  • The 6% AI-high-performer segment is the real story. McKinsey identifies 6% of respondents attributing more than 5% of EBIT to AI. That segment is structurally different from the broader 23% scaling cohort. It is the subset of the 23% whose scaled deployments survive scrutiny from finance, audit, and regulator at the same time. Scaling without that discipline produces revenue line items that look good in the quarterly deck and collapse at audit.

What McKinsey actually measured

The survey stratifies respondents into four states per deployment: not started, experimenting (pilot), scaling (production across multiple business units), and mature (scaled for >2 years with documented ROI). The 23% scaling + 6% high-performer numbers sit in the third and fourth buckets respectively. Most discussion of the McKinsey data collapses the four-state stratification to a single “adoption rate,” which obscures the actual signal.

The signal is the distribution shape, not any single percentage. Mapped against comparable datasets:

  • Stanford DEL 2026: 12% of deployments clear 300%+ ROI; 88% at or below break-even. Same bimodal shape as the McKinsey 6%-vs-the-rest high-performer split.
  • Gartner Q1 2026 I&O: 28% of AI I&O projects fully pay off. Broader scope than agentic specifically, longer tail.
  • Gartner June 2025 prediction: 40%+ agentic AI projects cancelled by end-2027. Forward-looking; consistent with the experimenting-not-scaling pattern.
  • CMU TheAgentCompany 2026: best-in-class frontier models complete 30.3% of enterprise agent tasks. Capability ceiling is real but does not vary between the 39% and the 23%.

Four datasets, four different measurement methodologies, one consistent pattern: a small high-performing tail, a much larger body that does not cross into it, and a capability ceiling that does not explain the gap. We covered this in the 88% piece; the McKinsey 23% view adds a specific contribution, which is the pilot-to-production transition.

Why the 39% experimenting do not scale

The pilot-to-production chasm in enterprise agentic AI has four recurring shapes:

Shape 1. No agent registry. Pilots are approved in one business unit by one or two people, often without a central record of what was approved, what it does, what data it touches, what tools it can call. When scaling is proposed, the governance, security, and compliance teams cannot evaluate the scale-up because the pilot was not documented to a registry-level of detail. The scale-up either stalls in review (most common) or proceeds under an approval that does not actually cover the expanded scope (next most common).

Shape 2. No measured baseline. The pilot’s “success” was measured against vendor estimates or against a reconstructed baseline after the fact. When the CFO asks “what is the net benefit of scaling this from 50 users to 5,000?” the answer cannot be computed from the pilot’s data. The business case for scaling then gets written on the same vendor estimates that drove the pilot, and inherits all the measurement weaknesses the CFO’s business case piece catalogues.

Shape 3. No change-management playbook for adjacent units. The pilot unit adopted because the pilot team championed it. Adjacent units are structurally different: different workflows, different adoption patterns, different reporting lines. A pilot that shows 80% adoption in the pilot unit often shows 30-40% adoption in the adjacent unit when the same playbook is applied. Without a differentiated change-management approach, scaling dilutes the pilot’s measured outcomes.

Shape 4. No agent-specific threat model. The pilot was approved against the application-security template with “AI” inserted. At pilot scale, the attack surface is small enough that security teams accept the gaps. At scaled production, the attack surface is several orders of magnitude larger and includes cross-agent delegation patterns the pilot never exercised. Scaling without an upgraded threat model is where EchoLeak-class zero-click exploits find their targets. MTTD-for-Agents instrumented at pilot is the cheap intervention here.

Each of the four shapes is tractable individually. The 39% experimenting mostly have three or four of them at once, which is why the scale-up does not happen.

The 6% AI-high-performer segment

McKinsey’s 6% AI-high-performer segment is the richest signal in the dataset. These are not the 23% scaling minus some cohort; they are the 23% scaling who can also defend the ROI number at the CFO level. The distinguishing behaviour is measurement discipline, not deployment velocity.

Cross-referenced against the Stanford DEL 12% cohort, the 6% appears to be roughly the subset of the 12% operating at enterprise scale. The 12% includes mid-sized firms and single-unit deployments; the 6% AI-high-performer segment is specifically firms running scaled agentic AI with more than 5% of EBIT measurably attributable. The difference is the audit-survivability bar.

What the 6% share, operationally:

  • Per-deployment GAUGE scores in the 70-84 band (“durable”), not the 50-69 band (“functional”) where the broader 23% scaling cohort sits.
  • Documented pre-deployment baselines on every scaled agent, not just the flagship.
  • Quarterly measurement reviews that include external validation, not just internal self-reporting.
  • Compliance evidence maps for the relevant frameworks (NIST AI RMF, ISO/IEC 42001, EU AI Act, and sector-specific obligations including NIS2 and GDPR Article 33 incident-reporting), maintained as living documents rather than once-a-year updates.
  • Vendor arrangements with tested exit paths. Not theoretical portability, actually tested.

The broader 23% scaling cohort shares some but not all of these. The distinguishing feature of the 6% is consistency across all five.

The four preconditions for scaling

An enterprise in the 39% experimenting cohort considering whether to scale a specific agent can answer four questions before deciding:

  1. Can we list every agent currently deployed in the enterprise, with owner and scope, within one hour? If no, scale-up of any specific agent will be gated by governance review for months while the inventory is constructed. Build the registry first.
  2. Do we have a measured pre-deployment baseline on the pilot, documented to a methodology? If no, the business case for scaling cannot be written to survive CFO scrutiny. Instrument the baseline on the current pilot, measure 4-6 weeks, then decide.
  3. Do we have a differentiated change-management plan for the adjacent business units? If no, scaling will produce uneven adoption that dilutes the reported ROI and damages the change-management narrative for the next scale-up. Write the plan before the expansion.
  4. Does the threat model cover cross-agent delegation at scale? If no, scaling raises the attack surface past what the pilot’s security review considered. Upgrade the threat model and instrument MTTD before the scale-up, not after.

The four questions are operational preconditions, not nice-to-haves. An enterprise answering “no” to three of the four is not ready to scale the pilot. An enterprise answering “no” to all four is not ready to run the pilot.

The GAUGE diagnostic operationalises all four questions into a 30-45 minute working-group session. Pilots that score below 40 are not ready to scale; pilots that score 55-69 are ready to scale with documented remediation on the lowest-scoring dimension; pilots that score 70+ are ready to scale on the condition that the measurement discipline extends to the expanded scope.

What the data implies for Q2-Q4 2026

The Stanford DEL, McKinsey, Gartner, and CMU datasets together produce a coherent picture: the distribution of enterprise agentic AI outcomes is bimodal, the bimodality is not explained by model capability, and the distinguishing variable is operational discipline around governance dimensions the GAUGE framework scores.

Two implications for how the picture moves through 2026:

First, the 23% scaling cohort will grow in the back half of the year as current experiments mature, but the 6% AI-high-performer cohort will grow more slowly. That is because scaling requires hitting the four preconditions above, while scaling with audit-survivable ROI requires hitting them plus the measurement discipline that separates the 12% from the broader 23%. The cohort gap stays wide.

Second, the forward-looking Gartner cancellation projection (40%+ by end-2027) implies that a large share of the 39% experimenting will end up in the “deployed and stopped” 38%, not in the 23% scaling. Those cancellations are not failures of vendor choice or model capability; they are failures of the scaling transition the McKinsey data describes.

The implication for any enterprise currently in the 39% experimenting cohort is specific. The question “should we scale this pilot?” is usually the wrong question. The right question is “have we met the four preconditions?” If the answer is no, the productive next step is to meet the preconditions on the pilot, then re-answer the scaling question with real data.

Holding-up note

The primary claim of this piece (that the McKinsey 23% scaling figure is an operational-preconditions outcome, not a technical-readiness outcome, and that the four preconditions are the tractable intervention for enterprises in the 39% experimenting cohort) is on a 60-day review cadence. Three kinds of evidence would move the verdict:

  • A subsequent McKinsey wave or equivalent large-sample dataset showing the 23% and 6% figures compressing toward the 39% experimenting. Would indicate the pilot-to-production chasm is closing without the operational-preconditions intervention. Would partially weaken this piece.
  • Cross-enterprise analyses showing that pilots not meeting the four preconditions scale successfully at the same rate as pilots that do. Would materially weaken the piece’s central argument.
  • Analyst frameworks (Gartner, Forrester) converging on a preconditions-style framing for the pilot-to-production transition. Would strengthen and partially obviate the piece’s framing contribution.

If any land, the Holding-up record for AM-030 captures what changed, dated. Original claim stays visible. Nothing is quietly removed.

ShareX / TwitterLinkedInEmail

Spotted an error? See corrections policy →

Part of the pillar

Enterprise AI cost and ROI

Verifying, tracking, and challenging the ROI claims vendors and analysts make about enterprise agentic AI. 10 other pieces in this pillar.

Related reading

Vigil · reviewed