The MIT 95% GenAI-pilot-failure claim: what the State of AI in Business 2025 report actually measured
MIT NANDA's GenAI Divide report (August 2025) is the source of the 2026's most-cited bear-case statistic: 95% of generative AI pilots fail. The number is a self-reported survey result with a specific methodology, and the way it gets read in procurement decks materially overstates what the underlying data supports. The structural findings underneath the headline are more useful than the headline itself.
Holding·reviewed4 May 2026·next+59dBottom line. The MIT 95% statistic is 95% of GenAI pilots delivered no measurable P&L impact per MIT NANDA’s GenAI Divide: State of AI in Business 2025 — based on 150 executive interviews, 350 employee surveys, and analysis of 300 AI projects (August 2025). “No measurable P&L impact” is a different finding from “the pilot failed”. The report’s structurally interesting findings underneath the headline are the build-vs-buy 67%-vs-roughly-22% spread and the 40%-licensed / 90%-shadow-using gap, both of which point at the operational pattern the headline number obscures.
The 2026 enterprise agentic AI procurement conversation has two dominant statistics anchoring opposite ends of the optimism spectrum. McKinsey’s 17% EBIT-attribution figure sits at the bull end. MIT NANDA’s 95% pilot-failure figure sits at the bear end. Both numbers are real survey results from competent research organisations. Both numbers are read in board rooms and analyst notes in ways the underlying methodology does not support. This piece is about the second one.
The MIT State of AI in Business 2025 (Project NANDA, MIT Media Lab) was published in August 2025 and immediately produced a Fortune cover story, follow-up commentary, and a measurable equity-market reaction. The headline number, framed in vendor decks and CIO emails as “MIT says 95% of AI pilots fail”, has since traveled into 2026 procurement materials at scale. It is now competing with McKinsey’s 17% as the most-cited single statistic in enterprise AI commentary.
The way the 95% is read in 2026 procurement is materially different from what the report establishes. This piece walks the methodology, the slippage between “no measurable P&L impact” and “the pilot failed”, the structural findings underneath the headline that are more useful than the headline itself, and how a procurement team should actually use the report.
What the report measured
NANDA’s methodology is documented in the report and summarised consistently across the Fortune coverage and the State of AI in Business landing. Three data layers compose the study.
The first is 150 executive interviews. These are senior-leadership conversations, qualitative, structured around the executive’s account of the organisation’s GenAI deployment status, the use cases attempted, the outcomes observed, and the obstacles encountered. The interview format produces narrative data, not measurement data; the executive’s read of “did this pilot work” is the unit of analysis, not the pilot’s actual financial performance.
The second is 350 employee surveys. These are quantitative responses from individual contributors and middle managers about their personal use of GenAI tools at work, the tools they use officially versus unofficially, the tasks they apply them to, and their perception of the value produced. The employee survey is the source of the shadow-AI finding (40% of companies have official LLM subscriptions; 90% of workers report daily personal-AI-tool use).
The third is analysis of 300 AI projects. These are project-level case studies the research team examined for outcome patterns. The 300 projects are not a random sample of all enterprise AI deployments; they are a curated set the NANDA team had access to either directly or through published case material.
The headline 95% applies to the third layer. Of the 300 projects analysed, 95% delivered “no measurable P&L impact”. Five percent of integrated systems “created significant value”. The 95% is a project-level finding, not a company-level finding, and the unit being measured is “measurable P&L impact” specifically, not “the pilot succeeded technically” or “the users found it useful” or “the workflow improved”.
The slippage between “no measurable P&L impact” and “the pilot failed”
The structural reading the 95% number gets in 2026 procurement decks is “95% of AI projects fail”. That reading is incorrect for two specific reasons.
The first is the difference between absence of measurement and presence of failure. Most enterprise GenAI pilots in 2024-2025 did not have a documented pre-deployment P&L baseline against which to measure post-deployment impact. Without a baseline, “no measurable P&L impact” is the default finding regardless of whether the pilot actually moved the operational needle. The pilot can be objectively useful, observably faster, and qualitatively appreciated by users while still producing zero data point on a P&L line because nobody set up the measurement infrastructure to detect it.
The Fortune follow-up coverage made this explicit. The 21 August 2025 piece characterised the underlying issue as a measurement-and-deployment-design problem rather than an AI-capability problem. Verbatim from the coverage: “people and organizations simply did not understand how to use the AI tools properly or how to design workflows” that could harness AI benefits. The 95% is a finding about deployment maturity, not about model maturity.
The second is the project-versus-deployment ambiguity. A “project” in the NANDA sense is the granular thing teams build, often a single use case at a single business unit. An enterprise running 20 GenAI projects in 2025, of which one delivered measurable P&L impact and 19 did not, would be classified as a 5%-success enterprise on a project-weighted view and as a 100%-success enterprise on a “any project produced value” view. The 95%-fail framing implies the former; the procurement question is usually closer to the latter. Whether the enterprise’s portfolio produced any measurable value is not what the 95% answers.
The slippage between these two readings is structurally identical to the McKinsey 17% EBIT-attribution gap (claim AM-053-adjacent), but in the opposite direction. McKinsey’s number is a self-reported success that gets read as audited success. MIT’s number is a self-reported absence of measurement that gets read as audited failure. Both readings are wrong in the same way: they conflate survey results with operational measurements.
The findings underneath the headline that are more useful
The 95% travels because it is a single dramatic number. The report’s other findings are less dramatic and more useful for procurement teams running the actual decision.
Build versus buy: 67% versus roughly 22%. Per the report and the Fortune coverage: “Purchasing AI tools succeeded 67% of the time, while internal builds panned out only one-third as often.” The “one-third as often” produces a build-success rate of approximately 22%. This is the single most actionable finding in the report and the one most procurement teams ignore in favour of the headline. A 2026 enterprise budgeting an internal-build approach is, on the report’s data, accepting a 3x worse outcome distribution than the buy approach. The report’s framing is that this gap is operational rather than technical; the buy approach inherits the vendor’s experience with deployment patterns the customer’s internal team is reinventing from scratch.
Shadow AI: 40% licensed, 90% using. Of companies surveyed, 40% had official LLM subscriptions. Of employees surveyed, 90% reported daily personal-AI-tool use for work tasks. The gap between the two numbers is the shadow-AI surface that OPS-041’s platform-algorithm-penalties piece and the 1-page AI policy template both address from different angles. The procurement implication is that the “we have not deployed GenAI” enterprise is in fact running GenAI through unauthorised channels at near-saturation; the policy and security work has to assume the deployment is already happening and govern accordingly.
The learning gap. The report identifies that current GenAI tools “remain static” and “make the same errors repeatedly”, whereas users expect tools that learn from interaction the way human assistants do. This finding is structurally adjacent to the agent evaluation frameworks comparison (claim AM-122) and the eval-as-deployment-precondition argument it makes. A pilot without an eval substrate cannot learn from its mistakes systematically; that absence is the operational reason for the static-error pattern the report names.
Marketing-vs-back-end deployment misdirection. The report finds enterprises “deploying AI in marketing and sales, when the tools might have a much bigger impact if used to take costs out of back-end processes”. This is a procurement-prioritisation finding rather than a capability finding. Marketing-and-sales deployments are easier to scope but produce measurable P&L impact through revenue lift, which is harder to attribute. Back-end-cost deployments produce measurable P&L impact through cost reduction, which is easier to audit. Procurement teams optimising for measurable impact would weight back-end use cases more heavily than the marketing-and-sales-first pattern most 2025 deployments followed.
The startup advantage. The report finds startups “much more likely to find genAI can deliver ROI” than entrenched-process enterprises. The structural reason is that startups deploy GenAI into workflows that are still being designed, where the AI’s strengths shape the workflow, while enterprises deploy GenAI into existing workflows whose process structure was designed for a non-AI tool surface. The procurement implication is not that enterprises should mimic startups; it is that an enterprise scoping a GenAI deployment should expect to redesign the surrounding workflow rather than slot the tool into the existing one.
The methodological caveats the report itself names
NANDA is an MIT initiative organised around AI startup research and the broader AI-adoption ecosystem. The Fortune coverage flagged the structural incentive: NANDA’s institutional position means the organisation “might have an incentive to suggest that current AI methods aren’t working” because the not-working framing supports the case for the newer wave of AI tooling NANDA’s research focuses on. The Fortune author found no evidence of deliberate skewing in the methodology and, on close reading of the report, the methodological choices are defensible. The incentive caveat is worth surfacing in procurement materials that cite the 95% nonetheless; the same procurement standard that asks “what is the vendor’s incentive in this stat” applies to research organisations on the bear side as it does to consultancies on the bull side.
The 300-project sample is the second methodological caveat. The projects were not randomly selected from all 2024-2025 GenAI deployments; they were the projects NANDA had access to either directly or through published case material. The selection bias direction is uncertain: it could skew toward documented-failure cases (which are more likely to be written up than quiet successes) or toward documented-success cases (which are more likely to be published by vendors and customers proud of them). Either skew would produce a 95% number that does not generalise cleanly to the all-deployments base rate. The report does not claim the 300 are representative; the citation chain that produces the “95% of all enterprise AI projects fail” reading does claim that, incorrectly.
How a procurement team should actually use the report
Three operational uses of the report survive the methodology critique.
Use 1: validate the build-or-buy decision. The 67%-vs-22% spread is the cleanest finding in the report and the one most directly relevant to procurement. A team scoping an internal build should explicitly underwrite the 22% success-rate prior; if the team cannot defend why their build will outperform the documented base rate by a factor of 3 or more, the buy approach is the procurement-defensible choice on the report’s evidence. The 60-question agentic AI RFP (claim AM-026) is the procurement instrument that operationalises this; the build-versus-buy question is dimension 5 (vendor lock-in) under the GAUGE framework.
Use 2: scope the shadow-AI policy work. The 40%-licensed / 90%-using gap is the empirical anchor for the “AI policy is not optional” argument. A 2026 enterprise without an AI-use policy is governing zero percent of the 90% of employees using AI tools daily. The 1-page AI policy template and the 60-question AI agent risk register (claim AM-051) are the operational artefacts; the report’s data is the case-for-action.
Use 3: prioritise back-end cost-reduction over marketing-and-sales revenue-lift use cases. The deployment-misdirection finding produces a clear procurement priority order. A team scoping their first or second GenAI deployment should weight back-end use cases (procurement-to-pay automation, finance-close acceleration, regulatory-reporting drafting, IT-helpdesk deflection) more heavily than the marketing-and-sales-first pattern the report names as common but underperforming. The AI in IT operations reality check (claim AM-121) walks the L1-deflection cohort specifically; the same pattern applies to the other back-end candidates.
The 95% headline is what the report is cited for. The findings underneath are what the report is useful for. Procurement teams citing the headline without engaging the underneath are using the report the same way they use the McKinsey 17% — as a rhetorical anchor for a position they have already adopted, rather than as evidence to update against. The corrective is the same in both cases: the survey result is not the audit; the audit is what your procurement decision should rest on.
The procurement question the report does not answer
The MIT report does not answer whether the 95% will be 95% in 2026, 2027, or 2028. The methodology measured the 2024-2025 deployment cohort. The structural findings (build-vs-buy, learning-gap, shadow-AI, deployment-misdirection) are mechanism-level and likely persist on a multi-year horizon; the headline 95% is a snapshot that should change as deployment maturity, eval discipline, and procurement-grade tooling improve.
The Stanford Digital Economy Lab’s 2026 Enterprise AI Playbook (claim AM-029) finds a 12/88 bimodal ROI distribution at 12-18 months post-production-deployment. The Stanford 88% and the MIT 95% are not the same number — Stanford measured deployments with documented baselines that had reached 12-18 months in production, MIT measured pilots with no required baseline at any maturity stage. But both numbers point at the same operational reality: most enterprise GenAI work in 2024-2025 was not yet producing measurable enterprise-level value. The Stanford 12% bimodal-success cohort and the MIT 5% integrated-systems-created-significant-value cohort are similar in shape and probably overlapping in identity.
The procurement question for 2026 is not “is the 95% real” but “what does the 5% do that the 95% does not”. The MIT report names the build-vs-buy spread, the shadow-AI gap, and the deployment-misdirection pattern as three of the answers. The Stanford report names the GAUGE governance dimensions as a fourth. The eval-platform decision, the observability decision, the audit-substrate decision, and the OWASP-Top-10 control decision are operational extensions of the same. A 2026 procurement team that uses these as a checklist rather than the 95% as a slogan is operating on the report’s actual evidence rather than the citation chain’s headline.
The 95% is the most-cited statistic. The build-versus-buy 67%-vs-22% is the most-actionable. The procurement teams that move the deployment outcome are the ones that read past the first to the second.
Spotted an error? See corrections policy →
Reasoned disagreement is a first-class signal here. Every review cycle weighs documented dissent; material dissent becomes part of the article's change history. This is not a corrections form — use /corrections/ for factual errors.
Enterprise AI cost and ROI →
Verifying, tracking, and challenging the ROI claims vendors and analysts make about enterprise agentic AI. 11 other pieces in this pillar.