Rewrite in progress

This piece predates the current editorial standard and is in the rewrite queue. The body below is retained for link integrity while the new analysis is prepared. When the rewrite ships, the claim (AM-061) moves from Partial to Holding and the update is dated in the correction log.

Most enterprise agentic-AI programmes land their first production cost surprise somewhere between month four and month nine. The pilot ran inside a signed-off budget envelope. The general-availability rollout, with real adoption, real query volume, and real context loads, runs at a multiple of that envelope. Andreessen Horowitz’s analysis of the economics of large language model inference and the same firm’s 16 changes to the way enterprises build and buy generative AI document the pattern: token costs collapsing on a per-million-token basis while aggregate enterprise spend keeps rising, because the workload is expanding faster than unit economics are improving.

The 2026 read is that the cost-overrun story is not a vendor-pricing story. Per-token prices have fallen substantially across providers since 2023 and continue to fall. The story is a workload-architecture story. Production agentic systems consume tokens, orchestrate calls, and carry observability overhead differently from the pilot. CFOs evaluating agentic-AI business cases need a layered optimisation framework for the same reason they need a layered TCO model: the cost is not in one line item, and neither is the saving.

What production agentic AI actually costs at scale

The structural reason production costs run hot is that the unit of work changes between pilot and production. A pilot typically measures cost per call. Production measures cost per workflow, and a workflow under an agentic architecture is rarely one call.

Anthropic’s engineering write-up on Claude’s multi-agent research system reports that agents typically use about 4× more tokens than chat interactions, and multi-agent systems about 15× more than the equivalent single-turn task. At those ratios, an agentic deployment functionally indistinguishable from the chatbot it replaces is, on a token-economics basis, an order of magnitude more expensive.

McKinsey’s State of AI tracking shows enterprises moving from pilot to production discover hidden cost categories absorbed into experimental budgets that become real line items at scale: integration engineering, evaluation infrastructure, observability tooling, prompt and policy maintenance, human oversight during the first phase of production. None shrink with model price drops; some grow. Gartner’s April 2026 finding that only 28% of enterprise AI projects in infrastructure and operations fully pay off, with 57% of failure-experiencing leaders citing “expected too much, too fast,” is the aggregate read-out of that gap. The capability is real; the unit economics under production load were not what the business case assumed.

The four cost-driver categories

Across published practitioner write-ups and platform vendor documentation, four categories recur. They are useful as a budgeting structure because each maps to a distinct optimisation lever. A CFO reviewing a production agentic-AI cost line should be able to attribute spend across all four.

Model-tier selection. The default error pattern is to route every call to the highest-capability model because that is what the demo used. In production, a meaningful share of agent steps — classification, extraction, deterministic transformation, simple branching — does not require frontier-model capability. Anthropic, OpenAI, and Google all publish tiered model pricing precisely because workloads are heterogeneous. Programmes that route tier-by-step rather than tier-by-deployment recover meaningful spend without functional regression. The work is in measurement: which step categories actually exhibit accuracy degradation under a smaller model, and which do not.

Request volume and caching. Agentic systems re-process the same context, retrieved documents, and instruction blocks across many turns and many agents. Anthropic’s prompt caching documentation offers up to 90% cost reduction on cached input tokens; OpenAI’s prompt caching and Google’s equivalent on Vertex offer similar mechanics. These are vendor-supplied levers that pilot architectures rarely engage because pilot volumes do not justify the engineering, and that production architectures must engage because volumes do. The same applies to vendor batch APIs — typically ~50% off list for asynchronous workloads with hours-not-seconds latency tolerance. A non-trivial slice of agentic workloads (overnight reconciliation, document processing, reporting) is batch-tolerant and is being run online by default.

Context-window drift. This is the cost driver least visible in the pilot phase and most visible in production. As agents iterate, accumulate retrieved context, retain conversation history, and load tools, the average input-token count per call drifts upward across deployment lifetime. The pilot measured cost per call against pilot-era prompts; six months in, the same call carries multiples of the input tokens. The Anthropic multi-agent token-ratio finding above is partly this: multi-agent systems carry more context state, and the context grows as the system does more. Programmes that measure average input tokens per call as a leading indicator catch this drift before it becomes a budget event.

Orchestration and observability overhead. Production agentic systems require trace logging, evaluation harnesses, tool-use audit, anomaly detection, and human-in-the-loop review queues. Vendor agent observability tooling and third-party platforms (LangSmith, Arize, Datadog’s LLM observability) are real line items, not infrastructure absorbed into existing IT budgets. McKinsey’s State of AI puts the change-management and operational-tooling category at a meaningful share of true AI TCO; under EU AI Act high-risk obligations and NIS2 incident-reporting requirements, it is rising rather than falling.

The four are not equally weighted across deployments. Customer-service agents tend to be dominated by volume and caching; document-processing by model-tier and batch economics; multi-step research agents by context drift; high-stakes financial or clinical agents by observability overhead.

What this means for CFOs evaluating the business case

Three implications follow for the finance side of an agentic-AI investment committee, as a complement to the three-document business case framework.

Cost-per-workflow is the only metric that survives the pilot-to-production transition. Cost per call, cost per token, and cost per user are useful operational signals. None is the right unit for the business case. A workflow that looks economical at $0.04 per call but consumes 12 calls per completion is a $0.48 workflow, before orchestration overhead. The pilot business case should be re-stated in cost-per-workflow terms before GA approval; the production business case should be tracked the same way.

The pilot-to-production cost gap is a default, not an outlier. Pilots run on developer-supervised flows with tight context windows and skipped observability. Production runs on real users with real load and full operational tooling. A business case that does not assume a multiplier between the two is implicitly assuming the production workload will look like the demo — which the McKinsey, Stanford, and Gartner data above all suggest is the failure pattern. The defensible base case is that production unit costs run several times the pilot, and that the optimisation work is what closes the gap.

The optimisation lever set is finite and vendor-published. Caching, tiered routing, batching, context discipline, observability budgeting — none of these are novel techniques requiring proprietary tooling. They are documented in the platform vendors’ own engineering blogs and pricing pages. A programme that reports “we don’t yet have the engineering bandwidth to engage prompt caching” or “we route everything to the largest model because it’s simpler” is choosing to spend, not failing to optimise. The CFO question is whether that choice is being made consciously, with the spend modelled, or implicitly.

The five-step optimisation sequence

For CIOs and platform leads carrying a production agentic-AI line, a sequence that reflects the order in which the levers actually pay back:

Step 1 — Instrument cost per workflow before optimising anything. Tag every agent call with the workflow it belongs to, the step within that workflow, and the model used. Compute weekly cost-per-workflow at the cohort level. Without this telemetry, every optimisation claim is a story; with it, every claim is a measurement. The first 2-4 weeks of a cost programme should produce no optimisation deltas, only baselines. Programmes that skip this step report savings that don’t reconcile to the cloud bill.

Step 2 — Engage vendor prompt caching for stable context. For agents that share system prompts, instructions, retrieved documents, or tool schemas across calls — which is most agents — the cached-input discount is the highest-leverage optimisation available without architectural change. Anthropic’s prompt caching, OpenAI’s prompt caching, and the equivalent Vertex AI feature are vendor-published and well-documented. The work is identifying which tokens are actually stable across calls and structuring the prompt so the stable region is cacheable. Typically one engineer, one-to-two weeks per agent, paying back inside the first month at any non-trivial volume.

Step 3 — Tier the model selection per step, not per deployment. Audit the workflow for steps that do not require frontier-model capability — classification, extraction, formatting, deterministic checks, retrieval re-ranking. Route those steps to a smaller model in the same provider’s family. Maintain the frontier model for the genuinely difficult reasoning step. Public evaluation benchmarks (HELM, Vellum’s leaderboard, LMSys arena) support step-level tiering for most enterprise workloads, though the right cut needs to be measured against the deployment’s own evaluation set, not assumed from a leaderboard.

Step 4 — Move latency-tolerant workloads to batch. Overnight reconciliation, bulk document classification, periodic summarisation, evaluation runs against historical data — none of these need the synchronous API. The OpenAI batch API and Anthropic’s Message Batches API typically price at 50% of synchronous list, with completion within 24 hours. The engineering work is queue-and-poll plumbing; the savings on workloads that fit are immediate.

Step 5 — Treat context-window growth as a budget metric. Set a target average input-token count per call per agent, monitor it weekly, and treat sustained drift as a cost incident. Retrieval layer, conversation-history retention policy, and tool-schema compactness are all controllable. The pattern that drives runaway production costs over twelve months is the slow expansion of average context, not a discrete pricing event.

A sixth lever sits outside the cost-engineering frame: observability spend should be budgeted as a deliberate line item, not absorbed. The temptation in a cost-reduction programme is to cut observability because it reads as overhead. Programmes that do this lose the ability to measure the optimisation work they are doing.

Holding-up note

The primary claim of this piece — that production agentic-AI costs at scale routinely run multiples of POC projections, and that a layered optimisation programme covering model tiering, caching, batching, context discipline, and observability budgeting closes most of the gap — is on a 90-day review cadence. Three kinds of evidence would move the verdict:

A second-half-2026 vendor pricing or capability shift that materially changes the cached-input or batch economics. A 50% change in either lever’s discount would re-weight the optimisation sequence.
Aggregate practitioner data from McKinsey’s late-2026 State of AI cycle or Stanford’s 2027 AI Index showing the pilot-to-production cost multiplier compressing to under 2× across enterprise deployments. Would weaken the structural-multiplier framing.
A published peer-reviewed study contradicting the Anthropic 4×/15× token-ratio finding for agentic vs single-turn workloads. Would force re-statement of the context-drift category’s weight.

If any land, the Holding-up record for AM-061 captures what changed, dated. The original sentence stays visible, annotated. Nothing is quietly removed.

ShareX / Twitter LinkedIn Email

Correction log

27 Apr 2026Rewritten 27 Apr 2026 from 27 Jul 2025 WordPress-migrated original. Original used a fictional CTO scene (Marcus Chen, $4.2B logistics company, 9:47 AM Tuesday Seattle), fabricated case figures ($2.1M to $187K monthly, named-company before/after teardowns), fabricated expert quotes (Patricia Williams VP of Engineering at Walmart; David Park Principal at Goldman Sachs), and banned phrases (plot twist, the dirty secret, revolutionary, emoji subheads). Rewrite extracts the verifiable cost-driver categories with primary-source citations from Anthropic's published multi-agent token-ratio research, vendor prompt caching and batch-API pricing pages, McKinsey State of AI, Andreessen Horowitz on LLM inference economics, and Gartner's April 2026 I&O finding. REVIEW: Peter.

Spotted an error? See corrections policy →

Disagree with this piece?

Reasoned disagreement is a first-class signal here. Every review cycle weighs documented dissent; material dissent becomes part of the article's change history. This is not a corrections form — use /corrections/ for factual errors.

Part of the pillar

Enterprise AI cost and ROI →

Verifying, tracking, and challenging the ROI claims vendors and analysts make about enterprise agentic AI. 13 other pieces in this pillar.