What does the MIT-class research actually say?

The widely-cited 95-percent-of-generative-AI-pilots-fail framing comes from a class of enterprise-pilot studies in 2025 and 2026 published through MIT Sloan Management Review and the Boston Consulting Group's adoption-research streams. The studies tracked large-enterprise pilots, measured failure as inability to reach scaled production deployment with attributable P&L impact inside a 12-to-18-month window, and catalogued the failure modes (data readiness gaps, integration complexity, procurement and governance friction, change-management failure, talent-availability gaps). The methodology is sound for the cohort sampled. The cohort sampled is not the 1-to-50-person operator cohort, which has a different failure mode and a different success definition. The misread is reporting the 95% number as if it generalises, when the study population specifically did not include small firms.

What does pilot failure actually look like for a 5-person agency?

Five failure modes, none of which match the enterprise list. (a) The tool was assigned to the wrong person on the team and was unused after week two, so the standing subscription is a sunk cost with no productivity gain. (b) The tool produced output the founder did not trust enough to ship without rewriting from scratch, so the perceived time-saving was negative once the rewrite was included. (c) The tool was integrated into a client deliverable, the client noticed something off (template language, tone shift, factual drift), and the agency had to absorb the rework cost and a small reputational hit. (d) The tool's monthly cost compounded across three or four similar subscriptions and the operator could not justify the line item against the value produced, so the subscription was cancelled at the next billing cycle with no replacement planned. (e) The tool worked but the operator never built it into a routine, so the use was sporadic and the productivity gain never compounded. None of the five maps to the enterprise pilot-failure list.

Why is the success definition different?

Because the firm runs on cash flow rather than on transformation. A 5-person services agency does not measure pilot success by scaled production deployment with attributable P&L impact across a multi-business-unit rollout. The agency measures pilot success by whether the tool paid for its standing subscription inside the first 90 days, whether the deliverables produced with the tool passed the client's standard without rework, and whether the operator can describe the workflow improvement in plain English the following month. Those three measures are the operator's pilot test. They are not in the enterprise study. A pilot that succeeds on the operator's measures is a success at the operator's scale, regardless of what the 95% number says about the enterprise cohort.

What does this mean practically for an operator who reads the 95% number?

Three practical implications. First, the 95% number does not give the operator useful guidance on whether to start a pilot or whether to expand one in progress. It describes enterprise dynamics in a different operating environment. The operator should treat it as background context, not as a decision input. Second, the operator's pilot decision rests on three measurable signals (payback, deliverable quality, routine fit), each of which can be checked inside a 30-to-60-day window using the operator's existing operations, not on a 12-to-18-month evaluation cycle. Third, the operator who runs a pilot and finds it failing should diagnose against the small-firm failure list, not against the enterprise list. The remediation for sunk-cost-from-wrong-assignment is reassignment; the remediation for rewrite-cost-exceeds-savings is a different prompt strategy or a different tool; the remediation for client-rework-from-AI-deliverable is a tighter review step before delivery. The enterprise-style remediations (data-readiness program, governance committee, change-management consultant) do not fit the small-firm operating environment.

How does this connect to other operator pieces?

Three companions. [OPS-066 on break-even seat math](/operators/ai-break-even-headcount-smb/) covers the headcount-scaled adoption question for 5-to-50-person services firms; this piece extends that with the pilot-evaluation rule for the 1-to-50-person cohort more broadly. [OPS-061 on what to delegate to AI in a 1-to-5-person business](/operators/what-to-delegate-to-ai/) covers the task-selection question; this piece extends that with the after-the-fact evaluation question. [OPS-068 on solopreneur AI stack consolidation](/operators/solopreneur-ai-stack-consolidation/) covers the spend-side rationalisation; this piece extends that with the pilot-failure diagnosis for the use-cases that did not consolidate. The enterprise companion is [AM-146 on three questions for CIOs about agentic AI accuracy claims](/agentic-ai-accuracy-claims-task-baseline-methodology/), which addresses the analogous problem in the enterprise cohort: how to read vendor accuracy numbers without misclassifying.

How does this article track its own claim?

Claim OPS-069 in the Holding-up ledger, with a 45-day review on 1 Jul 2026. Trigger conditions for status changes: (1) MIT Sloan, BCG, or a comparable adoption-research stream publishing small-firm-specific (1-to-50-person) pilot-failure data inside the review window (would either confirm the structural argument that the enterprise framing misclassifies, or refine the small-firm failure-mode catalogue with new evidence); (2) a small-firm operator survey (Stripe Atlas, Brex, Ramp, or equivalent SMB-spend-and-adoption data publishers) producing pilot-outcome data at the cohort level that materially diverges from the five failure modes listed (would refine the failure-mode catalogue); (3) a major foundation-model provider publishing operator-cohort case studies with attributable revenue impact at the 1-to-50-person scale (would harden the success-definition argument by establishing what small-firm pilot success looks like in the public record); (4) a viral re-citation of the 95% number with new methodology that does include small firms (would update the source-document discussion in the piece). Full trigger list on the claim entry. Sibling: AM-146.

Small-firm AI pilots: why the MIT 95% failure misreads

At a glance

Claim

The widely-cited 95-percent generative-AI-pilot-failure framing (MIT Sloan Management Review and Boston Consulting Group adoption-research streams, 2025-2026) is methodologically defensible for the enterprise cohort the research sampled (large firms with dedicated AI functions, 12-to-18-month evaluation windows, scaled-production-deployment success definition) and materially misrepresents small-firm pilot dynamics. The 1-to-50-person operator cohort has a different failure-mode catalogue (tool-assigned-to-wrong-person, rewrite-cost-exceeds-savings, client-rework-from-AI-deliverable, line-item-stack-compounded-and-cancelled, sporadic-use-no-routine) and a different success definition (90-day payback at actual hourly rate; deliverable quality reaching the client without disproportionate rework; routine fit documented for handover). A three-question Monday-morning small-firm pilot test (payback, deliverable quality, routine fit) checked at 30 days and 60 days is the operator's actual evaluation instrument and replaces the enterprise 12-to-18-month evaluation cycle that the 95-percent number is measured against.

Supporting figure

The MIT Sloan-class research that produced the widely-cited 95-percent-pilot-failure framing for generative AI tracked enterprise pilots specifically, not 1-to-50-person operator deployments, and the failure mode the research catalogued does not map to the small-firm operating environment

Date

17 May 2026

Verdict

Holding(OPS-069)

Next review

1 Jul 2026(+13d)

A widely-cited research framing in 2025 and 2026 places the failure rate of generative AI pilots at around 95 percent, sourced from MIT Sloan Management Review and Boston Consulting Group adoption-research streams (MIT Sloan Management Review, AI Adoption research stream; Boston Consulting Group, AI Adoption research). The number is methodologically defensible for the cohort the research sampled. The cohort is large enterprises with dedicated AI functions, multi-quarter procurement cycles, and success criteria built around scaled production deployment with attributable P&L impact across multiple business units.

The 95 percent number has subsequently been used as a small-firm pilot signal by analysts, consultants, vendors, and the trade press. That use is a misread. The small-firm operating environment does not have the conditions the research measured against, and the small-firm pilot-failure mode is different in kind from the enterprise pilot-failure mode the research catalogued. Reading the enterprise metric as a small-firm signal misclassifies what actually happens at 1-to-50-person scale and produces operator-side decisions that do not fit the operating reality.

This piece does three things. First, it lays out what the enterprise study population actually tested and why the failure modes the research found do not map to small firms. Second, it catalogues the five small-firm failure modes that the operator-cohort experience actually produces. Third, it provides the three-question Monday-morning small-firm pilot test (full procedure in the FAQ above and the how-to section), which is the operator’s actual evaluation instrument.

What the enterprise study population tested

The MIT-and-BCG-class research sampled large enterprises, defined as firms with dedicated AI functions and multi-business-unit operating structures. The failure measure is inability to reach scaled production deployment with attributable P&L impact inside a 12-to-18-month window. The failure modes the research catalogued are five categories: data-readiness gaps (the firm’s data infrastructure cannot support the model’s grounding requirements), integration complexity (the AI capability cannot be wired into the firm’s existing operational systems at the scale required), procurement and governance friction (the firm’s risk, legal, and compliance functions cannot clear the deployment on the original timeline), change-management failure (the workforce cannot or does not adopt the new workflow at the rate required for the business case to hold), and talent-availability gaps (the firm cannot hire or develop the in-house capability to maintain the deployment).

Each of these failure modes is a real phenomenon at enterprise scale. Each is documented in the cited research. The 95-percent failure rate against this measure is a defensible empirical finding for the cohort sampled.

The cohort did not include 1-to-50-person operators. Small firms do not have dedicated AI functions, multi-business-unit structures, 12-to-18-month evaluation windows, procurement-and-governance committees, or data-readiness programs to fail. The five failure modes the research found are not the failure modes a small firm experiences, because the small firm is not running pilots under the conditions the failure modes presuppose. The 95 percent number does not generalise downward to the small-firm cohort; the methodology was not designed to, and the study population was not sampled for it.

The five small-firm pilot failure modes

The operator-cohort experience in 2025 and 2026, as observable across the OPS register’s prior pieces and across the small-business adoption commentary on AI tooling, produces a different failure-mode catalogue.

Mode 1: tool-assigned-to-wrong-person. The operator buys an AI seat or subscription. The seat is assigned to a team member whose work is procedural (operations coordinator, field technician, junior administrator) rather than text-and-judgment-heavy. The team member uses the tool once or twice in week one and then never opens it again. The subscription continues. The operator notices at the quarterly review that the seat has been unused for ten weeks and concludes “AI did not work for us”, when the actual failure was the assignment, not the technology. The remediation is reassignment, not abandonment.

Mode 2: rewrite-cost-exceeds-savings. The operator uses the AI to draft a deliverable. The output is 70 percent of the way there. The operator rewrites the remaining 30 percent. The total time spent (AI generation plus rewrite) is greater than the time the operator would have spent drafting from scratch, because the operator is a fast drafter and the AI’s first pass requires more correction than a blank page would. The net time saving is negative or zero. The operator concludes the tool is not worth the subscription, which is correct for the operator’s specific style, but the diagnosis (the AI failed) is partial; the actual diagnosis is that the operator’s drafting speed is higher than the average and the AI’s value is in a different use-case (research, summarisation, repetitive structured output) than first-draft generation. The remediation is task reselection, not abandonment.

Mode 3: client-rework-from-AI-deliverable. The operator integrates AI-generated content into a client deliverable. The client notices something off (template-shaped language, tone shift from prior communications, factual drift in a domain-specific reference, an em-dash density signature, a hallucinated citation). The agency absorbs the rework cost and a small reputational hit. The operator concludes “we cannot use AI for client work”, which is overcorrected; the actual diagnosis is that the review-and-revision step before delivery was insufficient for AI-generated material, and a tighter pre-delivery editing pass would catch the failure mode before the client sees it. The remediation is process, not abandonment.

Mode 4: line-item-stack-compounded-and-cancelled. The operator accumulates three or four AI-related subscriptions across writing, scheduling, customer service, and design. The aggregate monthly cost reaches the $150 to $200 range typical of the mid-2026 solopreneur stack documented in the Godberry Studios Zoom Solopreneur 50 teardown. The operator runs the quarterly cost review and cannot justify the aggregate against the productivity gain produced, because each line item produced a modest gain and the aggregate gain did not compound. The operator cancels all four. The aggregate failure is real (the spend did not pay for itself), but the per-tool failure may not have been; one or two of the four were earning their cost, and the cancellation removed them with the under-performers. The remediation is the test-before-cancel script in OPS-068, not blanket abandonment.

Mode 5: sporadic-use-no-routine. The operator uses the AI tool intermittently. Sometimes it produces a useful output; sometimes the operator forgets it exists for two weeks. The productivity gain never compounds because the use never settles into a routine. At the quarterly review the operator concludes “we tried AI and it did not transform anything”, which is an accurate description of the operator’s experience and a partial diagnosis: the failure was the workflow construction, not the technology. The remediation is to build a documented routine that another team member could follow, which is the routine-fit measure in the three-question test.

The five modes account for the operator-cohort experience that the enterprise-pilot-failure literature does not describe. None of them is in the MIT-and-BCG catalogue. All of them are routine in the small-firm cohort.

What success looks like at the small-firm scale

The success definition is different from the enterprise definition, and the difference is the load-bearing fact.

For a 5-person services agency, success is not scaled production deployment with attributable P&L impact across multiple business units. It is: did the tool pay for its standing subscription inside the first 90 days, measured against the operator’s actual time-and-cost saved at the operator’s actual hourly rate; did the tool’s output reach the client (or the public deliverable) without rework that exceeds the time-saving; can the operator describe the workflow improvement in plain English to another team member the following month and have that team member adopt the workflow.

Three measures, each checkable inside a 30-to-60-day window using the operator’s existing operations. No 12-to-18-month evaluation cycle. No steering committee. No data-readiness program. The three-question test in the how-to section above is the operationalisation of these measures.

A pilot that passes the three-question test at 30 days and 60 days is a success at the operator’s scale. The 95-percent enterprise number does not change that. A pilot that fails the test should be diagnosed against the five small-firm failure modes, and the appropriate remediation (reassignment, task reselection, tighter pre-delivery review, per-tool subscription rationalisation, routine construction) should be tried before the pilot is abandoned.

What the operator should do with the 95-percent number when it appears

When a vendor pitch, an analyst commentary, or a trade-press article cites the 95-percent failure number to the operator, the operator should treat it as background context, not as a decision input. The vendor pitch typically uses the number to argue that the operator needs a more sophisticated solution to beat the failure rate; the analyst commentary typically uses it to argue that adoption is overrated; the trade-press article typically uses it to argue caution. All three uses are reasonable framings for the enterprise audience and unreasonable framings for the small-firm audience. The operator who treats the number as a small-firm signal will either pull back too far on adoption (missing genuine productivity gains the three-question test would have surfaced) or invest in enterprise-style remediations (governance, change management, data programs) that do not fit the operating environment.

The right operator-side response is to ask: does the source population include 1-to-50-person firms; if it does, what are the cohort-specific findings; if it does not, what is the cohort-specific evidence on small-firm AI adoption from the actual small-firm data publishers (Stripe Atlas, Brex, Ramp, the small-business surveys from the trade-press operator audiences). The cohort-specific evidence will not produce a 95-percent number, because the failure mode is different.

For the seat-economics companion to this piece, see OPS-066 on break-even seat math for 5-to-40-person services firms. For the spend-side companion, see OPS-068 on solopreneur AI stack consolidation. For the task-selection companion at the 1-to-5-person scale, see OPS-061 on what to delegate to AI. For the contract-side companion on AI work delivery to clients, see OPS-065 on AI work delivery contract addenda. For the security-side companion on AI-IDE supply-chain discipline, see OPS-067 on the May 2026 Windsurf and MCP advisories.

For the enterprise analogue on accuracy-claim misreads, see AM-146 on three questions for CIOs about agentic AI accuracy claims. For the enterprise pilot-to-production gap that the 95-percent number actually addresses, see AM-140 on the agentic AI pilot-to-production gap.

ShareX / Twitter LinkedIn Email

OPS-069holdingsince 17 May 2026SiblingAM-146RegisterReporting

Spotted an error? See corrections policy →

Part of the pillar

AI tooling for operators →

Tool selection and head-to-head comparisons for solo founders and small teams — which AI stack actually pays back at SMB scale. 29 other pieces in this pillar.

Why small-firm AI pilots fail differently than enterprise pilots: reading the MIT 95% number from a 10-person agency

What the enterprise study population tested

The five small-firm pilot failure modes

What success looks like at the small-firm scale

What the operator should do with the 95-percent number when it appears

AI tooling for operators →

Related reading

What the enterprise study population tested

The five small-firm pilot failure modes

What success looks like at the small-firm scale

What the operator should do with the 95-percent number when it appears

Related operator reading

AI tooling for operators →

Related reading

Notion Workers: the free window closes 11 Aug

Notion's agents now cost money: which ones earn their credits

Building your own agents in Notion or ChatGPT without code: the safe-deploy playbook for 2026

AI-written analysis, signed by a practitioner. One or two pieces a week.

AI-written analysis, signed by a practitioner. One or two pieces a week.