Skip to content
Podcast · Episode 7 · 14:45

Why 88% of agentic AI deployments fail

Four datasets converge on the same bimodal shape across enterprise agentic AI. Stanford Digital Economy Lab's twelve-eighty-eight split. McKinsey State of AI 2025's twenty-three percent scaling cohort. MIT NANDA's ninety-five percent pilot-failure finding. The 67-versus-22 build-versus-buy spread. The variable separating the cohorts is operational discipline, not model selection. Six dimensions instrument the gap.

Claims walked in this episode
  • AM-029 · Why 88% of agentic AI deployments fail(Holding)
  • AM-132 · The bimodal ROI distribution in enterprise agentic AI: why the high-performing cohort is structurally distinct(Holding)
  • AM-128 · The MIT 95% GenAI-pilot-failure claim: what the State of AI in Business 2025 report actually measured(Holding)
  • AM-053 · HIPAA-compliant agentic AI: the 2026 healthcare playbook(Holding)

ABBY

This is Agent Mode AI. I'm Abby. Today we're walking the four claims that together describe the bimodal ROI distribution in enterprise agentic AI: AM-029, AM-132, AM-128, and AM-053. Four independent datasets converge on the same shape. The procurement question the convergence answers is what the high-performing cohort does that the struggling body does not.

AVERY

I'm Avery. Frame the four numbers.

ABBY

Stanford Digital Economy Lab's 2026 Enterprise AI Playbook tracks fifty-one enterprise agentic AI deployments at twelve to eighteen months post-production-deployment. Twelve percent clear three-hundred-percent-plus ROI. Eighty-eight percent operate at or below break-even. The distribution is not a Gaussian with a long tail. It is two distinguishable peaks separated by a discontinuity. Gartner Q1 2026 Infrastructure and Operations data reports twenty-eight percent of AI projects are fully paying off. McKinsey State of AI 2025 with one thousand nine hundred and ninety-three respondents reports twenty-three percent scaling agentic AI and seventeen percent EBIT-attribution at the twelve-month horizon. MIT NANDA's GenAI Divide reports ninety-five percent of analysed pilots delivered no measurable P&L impact.

AVERY

Four numbers. Same shape.

ABBY

Same shape. Different cuts of the same underlying pattern. The bimodal distribution recurs across four independent datasets with different methodologies and different sample populations. That recurrence is the procurement-relevant signal. The variable separating the two cohorts is consistent across the four cuts.

AVERY

Start with AM-128. The MIT 95%.

ABBY

AM-128 is the claim that the MIT 95% statistic is ninety-five percent of GenAI pilots delivered no measurable P&L impact, based on roughly one hundred and fifty senior executives interviewed, three hundred and fifty employees surveyed, and three hundred publicly disclosed GenAI deployments analysed in August 2025. The 95% finding is the share of pilots that produced no measurable P&L impact. The slippage in 2026 procurement decks is between no measurable P&L impact and the pilot failed.

AVERY

Two different things.

ABBY

Two different things. First, absence of measurement is not presence of failure. Most enterprise GenAI pilots in 2024-2025 did not have a documented pre-deployment P&L baseline. Without a baseline, no measurable P&L impact is the default finding regardless of whether the pilot moved the operational needle. Second, project-versus-deployment ambiguity. An enterprise running twenty projects with one delivering measurable impact would classify as five-percent-success on a project-weighted view and one-hundred-percent-success on an any-project-produced-value view. The 95%-fail framing implies the former; procurement questions usually want the latter.

AVERY

What's actually useful in the report.

ABBY

The build-versus-buy spread. Sixty-seven percent buy success versus roughly twenty-two percent build success. Per the report and the Fortune coverage, purchasing AI tools succeeded sixty-seven percent of the time, while internal builds panned out only one-third as often. A 2026 enterprise budgeting an internal-build approach is, on the report's data, accepting a three-times-worse outcome distribution than the buy approach. That is the most actionable finding in the report and the one most procurement teams ignore in favour of the headline.

AVERY

Move to AM-029. The Stanford piece.

ABBY

AM-029 walks the Stanford Digital Economy Lab's bimodal twelve-eighty-eight finding. Twelve percent of deployments clear three-hundred-percent-plus ROI at twelve to eighteen months. Eighty-eight percent operate at or below break-even. The Stanford 88% and the MIT 95% are not the same number. Stanford measured deployments with documented baselines that had reached twelve to eighteen months in production. MIT measured pilots with no required baseline at any maturity stage. But both numbers point at the same operational reality. Most enterprise GenAI work in 2024-2025 was not yet producing measurable enterprise-level value.

AVERY

The 12% cohort and the 5% MIT cohort.

ABBY

The Stanford twelve-percent bimodal-success cohort and the MIT five-percent integrated-systems-created-significant-value cohort are similar in shape and probably overlapping in identity. The thesis the publication tracks is that the operational discipline that produces the twelve percent is what produces the five percent in the MIT cut. The discipline is observable. It is not random. It is reproducible enough to instrument against.

AVERY

AM-132. The bimodal piece.

ABBY

AM-132 is the restored URL, was AM-014 status down on the Holding-up ledger after the original WordPress-era body used composite case studies that did not survive editorial scrutiny. The new body anchors on the four datasets we're walking and reframes the seventy-three-twenty-seven framing the slug carries as a rounded aggregation of the four cuts rather than a precise statistical claim. The bimodal shape is the load-bearing finding. The exact percentage points vary by methodology.

AVERY

What the high-performing cohort does that the struggling body does not.

ABBY

Six dimensions. The publication tracks them under the GAUGE framework. First, governance maturity. The cohort has a named accountable owner for the deployment, a documented decision authority for tool-use changes, and an escalation path that is exercised at least quarterly. Deployments without a named owner default into the struggling cohort regardless of other strengths.

AVERY

Two.

ABBY

Threat model. The cohort treats the agent's tool graph as a security surface and runs an explicit red-team cycle against it. The struggling cohort has typically run a generalised pen-test that does not exercise any of the four agent-specific surfaces and passed. The OWASP Agentic AI Top 10 names the threats; the agent red-team is the discipline that tests whether the defences hold.

AVERY

Three.

ABBY

ROI evidence. The cohort has a documented pre-deployment baseline before pilot day one. MIT NANDA's central finding is dominated by pilots that did not establish baselines. A deployment without a baseline does not produce a number to commit to regardless of how well the agent actually performs. The realistic ninety-day deliverable for a disciplined mid-market deployment is a working pilot pattern that scales into twelve-to-eighteen-month measurable ROI, not the three-hundred-percent-ROI vendor pitch.

AVERY

Four.

ABBY

Change management. The cohort assumes the deployment changes the surrounding workflow rather than slotting into it. MIT NANDA's startup advantage finding is the diagnostic. Startups deploy AI into workflows still being designed. Enterprises deploy AI into workflows whose process structure was designed for non-AI tools. Enterprise procurement teams that scope a deployment without budgeting for workflow redesign are budgeting for the struggling cohort outcome.

AVERY

Five.

ABBY

Vendor lock-in posture. The cohort treats lock-in as an explicit procurement dimension, exit data portability, kill-switch operability, sub-processor expansion rights, model-deprecation rights, rather than as something to be discovered at renewal. The 60-question RFP at AM-026 operationalises the dimension as one of the GAUGE axes. The struggling cohort typically signs the vendor's MSA with light edits and discovers the lock-in surface at month eighteen.

AVERY

Six.

ABBY

Compliance posture. The cohort runs the deployment against the regulatory regime that actually applies, EU AI Act Article 6, 11, 12, 16 for high-risk deployments, 21 CFR Part 11 plus GxP plus Annex 11 for pharma, HIPAA plus state law for healthcare, and treats the audit substrate as a load-bearing part of the deployment architecture rather than a documentation afterthought.

AVERY

The cohort that scores well across the six is the cohort that delivers.

ABBY

The cohort that scores well across the six is the cohort that delivers on the business case. The cohort that scores poorly is the cohort that produces the eighty-eight percent failure rate the slug names. The reproducibility of the gap across four independent datasets is what makes the framework actionable.

AVERY

AM-053. The McKinsey 17%.

ABBY

AM-053 walks the McKinsey State of AI 2025 finding that seventeen percent of organisations report measurable EBIT impact attributable to AI at the twelve-month horizon. The slippage is between self-reported attribution and audited attribution. McKinsey's number is a self-reported success that gets read as audited success. MIT's 95% is a self-reported absence of measurement that gets read as audited failure. Both readings are wrong in the same way. They conflate survey results with operational measurements.

AVERY

The procurement implication of all four together.

ABBY

Three concrete actions for a 2026 enterprise. First, score every active deployment on GAUGE before the next review cycle. Deployments scoring below twelve are in the struggling cohort regardless of how the project status is currently reported internally. Second, treat low-scoring deployments as a portfolio kill-or-fix decision, not a continuation default. The realistic move-from-eighty-eight-to-twelve horizon for a single deployment under sustained discipline is twelve months. Deployments where the team cannot commit to the discipline within one to two quarters are better killed than rescued. Third, anchor the next procurement against the cohort, not the average. Vendor case studies typically describe the high-performing cohort. The procurement team's deployment will land in the bimodal distribution that the four datasets document.

AVERY

Verdict update on each.

ABBY

AM-029 is Holding. Stanford's published methodology is intact. The deployment cohort is fixed at fifty-one. Cadence is sixty days. AM-132 is Holding. The four datasets are linked. The bimodal shape is documented across the four. Cadence is sixty days. AM-128 is Holding. The MIT NANDA report is published. The 95% framing and the build-versus-buy 67-22 spread are documented. Cadence is sixty days. AM-053 is Holding. The McKinsey 17% figure is published. The self-reported-versus-audited slippage is named. Cadence is ninety days.

AVERY

What would change any of them.

ABBY

For AM-029, a new Stanford DEL update with refreshed deployment cohort. For AM-132, vendor-side or analyst-side coordinated narrative shift on the bimodal framing. For AM-128, MIT NANDA published 2026 follow-up to the 2025 report. For AM-053, McKinsey State of AI 2026 mid-year refresh.

AVERY

Final word.

ABBY

The four claims, the four primary datasets, and the GAUGE diagnostic are linked at agentmodeai dot com slash holding. The Sunday brief ships every week with what moved on the ledger.

AVERY

Holding-up. See you next Sunday.

Vigil · 59 reviewed