Skip to content
Method: every claim tracked, reviewed every 30–90 days, marked Holding, Partial, or Not holding. Drafted by Claude; signed off by Peter. How this works →
AM-212pub10 Jun 2026rev10 Jun 2026read5 mininUnderstanding AI

What is AI observability, and why your APM cannot do it

Gartner predicts 40% of AI-deploying organisations will run dedicated AI observability tools by 2028. The reason it needs its own tooling: AI fails semantically — drift, bias, opaque reasoning — while classic monitoring watches infrastructure health.

Holding·reviewed10 Jun 2026·next+90d

Bottom line. Gartner predicts that 40% of organisations deploying AI will implement dedicated AI observability tools by 2028, against a nascent base today. The reason it needs its own tooling: AI fails semantically, through drift, bias and opaque decision logic, while the monitoring estate you already own watches infrastructure and application health. A model can be up, fast and error-free, and wrong.

Report. Gartner put a number and a date on the discipline on 12 May 2026: 40% of organisations deploying AI will implement dedicated AI observability tools by 2028, to monitor model performance, bias and outputs. Its definition comes in two parts, and the second is where the new discipline lives. Observability in general is “the characteristic of software and systems that enables them to be understood based on their outputs and enables questions about their behaviour to be answered.” AI observability extends it: it “requires dedicated tools that manage and assess the behaviour, decision-making and risks of an AI solution, such as model drift, bias and LLM logic.”

The analyst behind the release named the gap directly:

“AI is everywhere, but most organisations are still figuring out how to monitor and trust these systems. That visibility gap makes scaling risky, and that’s why observability matters. Unlike traditional software, AI’s decision making is often hidden, making it hard to explain or trust, yet errors can cause substantial financial loss, reputational damage and regulatory scrutiny.”

— Padraig Byrne, VP Analyst, Gartner, with the 12 May 2026 prediction.

QuestionClassic monitoring (APM)AI observability
Is the service up and fast?yes, its core jobassumed, not the point
Did the model’s behaviour drift this month?invisiblecore telemetry
Are outputs biased across segments?invisiblecore telemetry
Why did the agent decide this?invisiblethe reasoning trail
What did this outcome cost in tokens?invisiblespend-per-outcome

The distinction operationalises Gartner’s two-part definition; the row set is ours.

Why this is a separate discipline

Observe. The structural reason AI observability is not an APM feature is that the failure modes live in a different layer. Classic monitoring answers operational questions, and most estates today watch exactly that, infrastructure and application health. An AI system can pass every one of those checks while failing semantically: the model drifted, the outputs skew, the reasoning is wrong in ways nobody can reconstruct. Gartner’s release puts the operational consequence plainly: without standardised model telemetry, infrastructure and operations teams face prolonged incident resolution requiring “complex manual efforts to trace and debug the behaviors of opaque deep learning models.” The outage pages someone at 3 a.m.; the wrong answer compounds silently for a quarter.

Agentic systems add the cost dimension to the same telemetry problem. Gartner’s Hype Cycle for Agentic AI (2 Apr 2026, full document on subscription) warns that without rigorous financial guardrails, attribution and observability, agentic systems “can spiral into unpredictable token spend and API charges” with little insight into actual ROI, which is the same uninstrumented-behaviour failure wearing a FinOps costume, the gap our agentic cost-governance read maps in detail.

What it means for the monitoring estate

Reflect. For a CIO, the 40%-by-2028 figure reads less like a market forecast and more like a deadline shaped by regulation and incident exposure. The EU AI Act’s transparency obligations and the audit posture of frameworks like NIST’s AI RMF all presume you can answer questions about your AI’s behaviour, which is, literally, Gartner’s definition of observability. The engineering layer below this decision, which tracing and evaluation stack to run, is covered in our production observability read and the tooling comparison; this piece is the layer above: what the discipline is, and why the estate you own does not already do it.

The honest caveat cuts the other way too: a dedicated tool does not install the discipline. Buying an AI-observability platform without defining what a wrong outcome looks like for your systems reproduces the agent-washing pattern on the buyer’s side, capability theatre, this time in the monitoring rack, the failure mode the agent-washing test exists to catch.

Share thoughts. Start with definitions, not procurement. For each AI system in production, write down what a wrong outcome is, then instrument the time it takes you to detect exactly that. Detection time for agent misbehaviour is a measurable quantity, MTTD-for-Agents is our framework for it, and a system without a measured detection time has an unmeasured one, which is how AI incidents run for weeks before anyone notices. Tools then get evaluated against your definitions, in the order the production stack read walks. One unhedged line: if you deployed agents in Q1 and cannot say today what your detection time for a bad outcome is, that, not tooling selection, is your observability project.

Holding-up note

The primary claim of this piece (that AI observability per Gartner’s definition is a distinct discipline from classic monitoring because AI fails semantically rather than operationally, that Gartner predicts 40% of AI-deploying organisations will run dedicated tooling by 2028 from a nascent base, and that defining wrong-outcome metrics precedes tooling purchase) is on a 90-day review cadence. Three kinds of evidence would move the verdict: APM incumbents absorbing semantic AI telemetry convincingly enough that the dedicated-tool framing weakens; a later Gartner wave revising the 40% trajectory materially; or incident data showing organisations with dedicated AI observability detecting model failures no faster than those without, which would falsify the discipline’s premise. The Holding-up record for AM-212 captures what changes, dated. Figures are from Gartner’s published research as of 10 Jun 2026.

ShareX / TwitterLinkedInEmail
Cite this article

Pick a citation format. Click to copy.

Spotted an error? See corrections policy →

Disagree with this piece?

Reasoned disagreement is a first-class signal here. Every review cycle weighs documented dissent; material dissent becomes part of the article's change history. This is not a corrections form — use /corrections/ for factual errors.

Part of the pillar

Agentic AI governance

Governance frameworks, oversight patterns, and compliance postures for enterprise agentic-AI deployment. 61 other pieces in this pillar.

Related reading

Vigil · 36 reviewed