Skip to content
Method: every claim tracked, reviewed every 30–90 days, marked Holding, Partial, or Not holding. Drafted by Claude; signed off by Peter. How this works →
AM-142pub7 May 2026rev7 May 2026read13 mininAI Implementation

AI agent vs AI assistant vs LLM: the 2026 enterprise distinction

AI agent, AI assistant, and LLM are three structurally different categories in 2026. Procurement that conflates them buys the wrong governance shape, the wrong cost structure, and the wrong identity model.

Holding·reviewed7 May 2026·next+59d

Bottom line. AI agent, AI assistant, and large language model are three structurally different categories in 2026, even when the underlying weights are identical. The LLM reasons over a goal; the assistant adds tool-use with a human approving each step; the agent adds autonomy across a multi-step workflow with persistent state. The marketing layer collapses the three. Procurement that buys on the marketing layer ends up with the wrong governance shape, the wrong cost curve, and the wrong identity-and-access posture. The distinction is not pedantic. It is the variable that determines whether the deployment lands in the 23% scaling cohort or the 39% experimenting cohort McKinsey documented in State of AI 2025. Tracked at Holding-up AM-142 on a 60-day cadence.

Why the distinction matters in 2026

Every CIO deck in Q2 2026 has the same shape. A vendor slide shows “agentic AI” written across a roadmap; the supporting product is a chat interface with retrieval. A different vendor labels a single-shot summariser an “agent” because it calls one tool. A third sells a genuine multi-step orchestrator on the same line item as a chat assistant. The labels have collapsed. The procurement decisions have not.

The three categories (LLM, AI assistant, AI agent) sit on top of each other as additive capability tiers. Each tier inherits everything from the tier below and adds one structural property. The LLM reasons over text and produces text. The assistant adds the ability to call tools and produce structured output, with a human in the loop approving each consequential action. The agent adds autonomy: it decomposes a goal into a sequence of steps, executes them across a multi-step workflow, and maintains state between steps without a human re-confirming each one.

The same underlying model (GPT-5, Claude Sonnet 4, Gemini 2.5 Pro) can sit at any of the three tiers depending on the wrapper. The model’s weights do not change. The product around the model does. That is the source of the marketing collapse: vendors describe the floor (the weights) when they are differentiating, and the ceiling (the agent positioning) when they are selling. The procurement team has to read the gap between the two.

This piece names the three categories, the structural property each adds, the realistic capability ceiling each can reach in 2026, and the different procurement decisions each implies. The reader is the IT leader writing the deck that justifies the 2026 spend, not the engineer choosing a model SDK. The distinction matters because the non-human identity model, the evaluation framework, and the cost forecast all change shape between the three tiers.

LLM: what it is and what it isn’t

A large language model is a foundation model exposed at the API layer with no tool-use, no autonomy, and no persistent state between requests. The product is text-in, text-out. Each call is independent. The reasoning happens inside the model’s forward pass. The output is constrained by the prompt and whatever response format the API accepts.

In 2026 the canonical examples at this tier are GPT-5 and GPT-4o (OpenAI), Claude Sonnet 4 and Claude Opus (Anthropic), and Gemini 2.5 Pro (Google), accessed at the raw API layer. The same weights power the consumer products one tier up; at the raw API the wrapper is absent. Developers integrate them via SDK, prompt them with a single message or a chat history, and receive a single response. The model can produce JSON if prompted to, but the JSON is text the model emitted, not the result of any external action.

What the LLM tier explicitly does not do:

  • It does not call tools. It can describe what tool to call, but cannot execute the call.
  • It does not retain memory between requests beyond what the developer manually passes in.
  • It does not act autonomously. There is no loop, no goal-pursuit, no multi-step plan that runs without a human in the loop.
  • It does not access live data unless the developer fetches the data and passes it into the prompt.

The capability ceiling at this tier is bounded by what reasoning can be done in a single forward pass plus the context the developer supplies. That ceiling is high. The LLM can write code, draft contracts, synthesise research summaries, translate, classify. It is also bounded. The LLM does not, on its own, do anything that requires touching another system.

Procurement at the LLM tier is API-shaped. The cost is per-token. The identity model is API-key. The governance question is “what data flows into and out of the API call?” The evaluation question is “is the model’s output good enough on the inputs we send it?” There is no autonomy axis to govern, because there is no autonomy.

The mistake at this tier is buying an LLM and writing the business case as if it were an agent. The LLM cannot, in itself, complete a multi-step workflow; the procurement team has to build the wrapper, and the wrapper is where the cost and the governance work actually live.

AI assistant: LLM + tool-use, human in the loop

An AI assistant is an LLM wrapped in a tool-use harness with structured output and a human-in-the-loop approval gate on consequential actions. The harness adds function calling: the assistant can describe what action to take, the runtime executes the action, and the result feeds back into the next reasoning step. The human still approves each consequential step. State is maintained per conversation; the assistant does not run goals to completion across many steps without human confirmation.

The canonical 2026 examples at this tier are ChatGPT (consumer and Team), Claude.ai with computer-use, Microsoft Copilot for productivity (the Word, Excel, Teams, and Outlook integrations), Google Gemini in Workspace, and the various retrieval-augmented chatbots enterprises have shipped on top of LLM APIs. The product is a conversation. The user types a request; the assistant reasons; if a tool call is required, the assistant proposes it; the user approves (explicitly or implicitly through the UI flow); the action executes; the result returns to the conversation.

The structural property the assistant adds over the LLM is bounded action. The assistant can do things in the world, but each thing is gated by the human turn. The action surface expands considerably (read a calendar, draft an email, run a SQL query against an approved database, summarise a document the user uploaded), but the human remains the orchestrator. The assistant proposes; the human disposes.

Procurement at the assistant tier is application-shaped. The cost is per-seat or per-message, with the LLM token cost rolled in. The identity model is application-token: the assistant authenticates as the user, inheriting the user’s permissions for the duration of the session. The governance question is “what tools does the assistant have access to, and what data flows through them?” The evaluation question is “does the assistant’s behaviour stay within the user’s intent and the user’s permission scope?”

The capability ceiling at this tier is higher than the LLM ceiling because tool-use compounds reasoning with action, but it is still bounded by the human turn. An assistant cannot, by definition, run a five-hour goal without checking in. Vendors who claim otherwise are describing the next tier and labelling it down.

The mistake at this tier is the inverse of the LLM mistake. Procurement buys an assistant (Microsoft Copilot, ChatGPT Team, Claude.ai with computer-use) and writes the governance plan as if no autonomy is involved. The assistant inherits the user’s permissions on the user’s behalf. When the assistant has access to a tool the user has not consciously authorised it to use, the gap is in the procurement, not the product.

AI agent: assistant + autonomy across multi-step workflows

An AI agent adds autonomy on top of the assistant tier. The agent decomposes a high-level goal into a sequence of sub-tasks, executes them across multiple steps without a human re-confirming each one, maintains persistent state between steps, and pursues the goal until a stopping condition (success, failure, budget exhaustion, or human interrupt) is met. The human sets the goal at the start and reviews the outcome at the end; the steps in between run on the agent’s plan.

The 2026 examples at this tier are Microsoft 365 Copilot Agent Mode (the multi-step orchestration product), Cursor Agent Mode (the IDE coding agent), GitHub Copilot Coding Agent (the autonomous PR-writing product), Anthropic’s Managed Agents in public beta from April 2026, and the deployments built on the OpenAI Agents SDK. The product is a runtime: the user states a goal, the agent runs, the agent reports back when it is done or stuck.

The structural property the agent adds over the assistant is autonomy across a multi-step workflow. That autonomy is the variable that changes everything downstream. The cost curve becomes opaque (token spend depends on how the agent plans, not how the user types). The identity model breaks (per-action permissions cannot be inherited from the user when the user is not in the loop for each action). The evaluation problem inverts (the user cannot judge the output without re-doing the work). The governance shape is the four-layer extension covered in the non-human identity playbook.

The capability ceiling at this tier is the most-cited and most-misread number in the 2026 procurement conversation. Salesforce AI Research’s CRMArena-Pro benchmark lands top frontier models at approximately 35% multi-step task completion on enterprise CRM workflows. CMU’s TheAgentCompany benchmark lands them near 30% on broader enterprise agent tasks. Both numbers measure whether the agent completes the full task end-to-end, not whether it produces something useful along the way. The capability is real, and it is also bounded.

A procurement team that internalises the 30-35% number has a different conversation with the vendor than one that does not. The conversation moves from “can the agent do this?” to “what fraction of the time does the agent complete this end-to-end, and what does the failure mode cost us?” The second question is the one that determines whether the deployment scales.

Procurement at the agent tier is platform-shaped. The cost is platform-fee plus consumption (token spend, tool-call spend, observability spend, eval spend). The identity model is per-agent, time-bounded, with action-level approval gates. The four-layer extension is the operational fix, not optional. The governance question is “what is the agent allowed to do without checking in, and what is the blast radius if it does the wrong thing?” The evaluation question is “what fraction of goals does the agent complete to a quality bar that survives audit?” The agent eval frameworks coverage walks through the tooling that operationalises the second question.

Why the boundaries blur in marketing copy

Every vendor has a commercial incentive to label up the stack. Agent sells better than assistant; assistant sells better than LLM. The label compresses upward regardless of the product’s actual capability tier. A retrieval-augmented chat interface gets “agentic” marketing. A single-tool function-calling wrapper gets “agent” positioning. A genuine multi-step orchestrator and a single-shot prompt template share a category page on the same vendor site.

The compression is not unique to AI. The same pattern repeats with “AI” itself (every search box was AI-powered for two years), “intelligent” (the early-2010s wave of “intelligent” automation), and “smart” (every appliance sold in the last decade). The underlying mechanism is the same: the term that signals capability gets stretched to cover anything in the same general space, until the term loses its discrimination power and the next term emerges.

For agentic AI, the discrimination test is the multi-step capability test. Three questions a procurement team can ask the vendor and verify:

  1. Does the product run a goal to completion across multiple steps without a human re-confirming each one? If the answer is “the user approves each action,” the product is an assistant, not an agent. The vendor’s positioning may still be useful; the label is just inflated.
  2. Does the product maintain state between steps? If each step is a fresh conversation and the agent has to be re-briefed, the product is a multi-call assistant pattern, not an agent. The distinction shows up in the cost curve and in the failure modes.
  3. Does the product expose a benchmark number on a multi-step enterprise task suite? Vendors building genuine agents have started publishing CRMArena-Pro, TheAgentCompany, SWE-bench, or equivalent benchmark results. Vendors selling assistants under agent labels usually do not, because the benchmark would expose the gap.

The procurement team that walks the three questions through every vendor on the shortlist gets a much sharper sense of which products are actually at the agent tier. The exercise also surfaces the products that are at the assistant tier and are cheaper, lower-risk, and arguably better fits for the use case the team is actually trying to solve. Buying an assistant to do an assistant’s job and labelling the procurement honestly is not a step backwards. It is the difference between deployments that scale and deployments that stall.

The procurement implication

Three categories, three different procurement decisions. The shorthand:

LLM tier. API contract, per-token billing, governance scoped to the data flowing in and out of the API, identity model on API keys, evaluation on input-output quality. Procurement is straightforward; the work happens in the wrapper the team builds around the API. Cost forecast is bounded by request volume.

Assistant tier. SaaS contract, per-seat billing, governance scoped to the tools the assistant can access on the user’s behalf, identity model on application tokens inheriting user permissions, evaluation on session-level intent fidelity. Procurement is mid-complexity. The work happens in scoping which tools and which data the assistant gets access to per user role. Cost forecast scales with seat count.

Agent tier. Platform contract plus consumption, platform-fee plus token-and-tool spend, governance scoped to the four-layer extension model in the non-human identity playbook, identity model on per-agent identities with time-bounded credentials and action-level approval gates, evaluation on goal-completion rate measured against an enterprise-task benchmark. Procurement is high-complexity. The work spans IAM, security, finance, and the eval-framework selection covered in the agent evaluation piece. Cost forecast is opaque without observability and budget caps wired in from day one.

A CIO procuring across all three tiers in 2026 (the realistic case for any enterprise above mid-market) needs three different governance shapes, three different cost models, and three different review cadences. Treating them as one bucket produces the conflation the McKinsey 23% scaling number measured in aggregate: deployments that look like they are working, billed in the same line item as deployments that are not, and governed under a model that fits one of the three tiers but not the other two.

The framework is not novel. Practitioners have been drawing the LLM-assistant-agent diagram in slide decks for two years. What is novel is treating the three as three separate procurement workstreams with three separate governance shapes, instead of one workstream with three subcategories. The practitioners who land in the 6% AI-high-performer cohort McKinsey identifies appear to be the ones who run the three as three. The practitioners who land in the 39% experimenting cohort appear to be the ones who run them as one.

The next 90 days, for a CIO mid-procurement: take the current shortlist and sort each line item into LLM, assistant, or agent based on the multi-step capability test. The line items that move tier as a result are the ones where the procurement was being written against the wrong governance shape. That is where the remediation effort produces the highest return.

The primary claim of this piece (that LLM, assistant, and agent are three structurally different procurement categories in 2026, distinguished by reasoning, tool-use, and autonomy respectively) is logged at AM-142 on a 60-day review cadence. Three kinds of evidence would move the verdict: a vendor product genuinely collapsing two tiers (autonomous tool-use without per-action human approval, but without persistent multi-step state), a benchmark methodology change that moves the 30-35% multi-step ceiling materially upward across vendors, or a regulatory framework that defines the three tiers differently than the engineering literature does. The next review of this claim is scheduled for early July 2026.

ShareX / TwitterLinkedInEmail

Spotted an error? See corrections policy →

Disagree with this piece?

Reasoned disagreement is a first-class signal here. Every review cycle weighs documented dissent; material dissent becomes part of the article's change history. This is not a corrections form — use /corrections/ for factual errors.

Part of the pillar

AI agent procurement

The contracts, SLAs, and evaluation criteria that distinguish agentic-AI procurement from SaaS procurement. 23 other pieces in this pillar.

Related reading

Vigil · 57 reviewed