What are the four observability layers production agentic-AI deployments need?

Layer 1 is infrastructure observability: CPU, memory, network, container health — the existing Datadog, New Relic, Dynatrace, or Grafana stack already covers this. Layer 2 is LLM-call observability: token counts, model versions, prompt and completion content, per-call latency, per-call cost — Helicone, LangSmith, Langfuse, and Arize all sit here. Layer 3 is trace observability: multi-step reasoning chains, tool calls, intermediate state across an agent's task — OpenTelemetry GenAI semantic conventions, Honeycomb, AWS X-Ray with GenAI extensions. Layer 4 is output observability: output-distribution drift, hallucination rate, behavioural change over time — Galileo, Arize Phoenix, Evidently. The four together produce the four metrics in the AM-110 SLA framework. Enterprises with only Layers 1 and 2 cannot detect the failure modes Layers 3 and 4 catch.

What does each layer see, and what does each miss?

Layer 1 sees pipe health and misses everything model-shaped — a CPU graph cannot tell the deployment whether the agent is producing wrong actions. Layer 2 sees per-call cost, latency, and model behaviour at the request boundary, and misses the multi-step reasoning chain — a token graph cannot tell the deployment whether the second tool call should have happened given what the first one returned. Layer 3 sees the full reasoning chain and the tool-use graph, and misses the population-level question of whether output characteristics have shifted over the rolling window. Layer 4 sees the output distribution and the drift signal, and misses individual-trace explainability — a drift alert says the population moved, not which specific trace caused it. Each layer is necessary; none is sufficient. The deployment needs all four to cover the failure surface.

Which tools cover which layer in 2026?

Layer 1 (infrastructure): Datadog, New Relic, Dynatrace, Grafana plus Prometheus, Splunk, the existing observability platform whichever it is. Layer 2 (LLM-call): Helicone, LangSmith, Langfuse, Arize, plus the AI-monitoring extensions Datadog and New Relic both ship — Datadog AI Observability and New Relic AI Monitoring both ingest OpenTelemetry GenAI spans natively. Layer 3 (trace): OpenTelemetry GenAI semantic conventions as the protocol layer, with Honeycomb, AWS X-Ray with GenAI extensions, Google Cloud Vertex AI monitoring, and Arize Phoenix as ingestion targets. Layer 4 (output): Galileo, Arize Phoenix, Evidently AI, plus the model-monitor previews shipping inside Datadog and New Relic. The category structure is not contested across vendors; the protocol convergence around OpenTelemetry GenAI is what makes the four-layer model portable.

Why isn't Datadog or New Relic alone enough for an agentic deployment?

Datadog AI Observability and New Relic AI Monitoring both cover Layer 2 well and offer good coverage on Layer 3 through their AI-monitoring extensions. Neither product is a Layer 4 substitute. Output-distribution drift detection requires a baseline established on a deployment-specific calibration prompt set and statistical-drift detectors (population stability index, Jensen-Shannon divergence, or comparable measures) computed over a rolling window — work that Galileo, Arize Phoenix, and the open-source Evidently library are purpose-built for. A deployment that runs Datadog or New Relic alone for an agentic workload has Layers 1, 2, and partial 3, and is blind to the population-level drift question. The pattern in incidents reported through Q1 2026 is that the missing layer is detected not by the observability stack but by the customer complaining.

How does the four-layer stack map onto the AM-110 SLA metrics?

The four observability layers and the four SLA metrics from /agentic-ai-sla-architecture/ are the same architecture viewed from two angles: instrumentation and contract. Action-bounded availability is computed on Layer 3 traces with action-class labels. MTTD-for-Agents fires on signals across Layers 2, 3, and 4 — action volume from Layer 2, tool-use distribution from Layer 3, output distribution from Layer 4. Output-distribution drift is the Layer 4 metric directly. Per-class action error budget is computed on Layer 3 traces plus Layer 4 correctness evaluations. A deployment that has not instrumented all four observability layers cannot compute all four SLA metrics. The procurement question and the engineering question are the same question, asked of two different parts of the organisation.

Agent observability stack: 4 layers for production AI

Rewrite in progress

This piece predates the current editorial standard and is in the rewrite queue. The body below is retained for link integrity while the new analysis is prepared. When the rewrite ships, the claim (AM-114) moves from Partial to Holding and the update is dated in the correction log.

At a glance

Claim

Production agentic-AI in 2026 needs four observability layers — infrastructure, LLM-call, trace, and output — and most enterprise deployments instrument only the cheaper subset (Layers 1 and 2 plus partial Layer 3); the failure modes Layers 3 and 4 catch (multi-step reasoning failure and output-distribution drift) are the ones EU AI Act Article 9 and Article 17 evidence obligations from 2 Aug 2026 onward will require coverage of, and the four layers compose directly into the four AM-110 SLA metrics (action-bounded availability, MTTD-for-Agents, output-distribution drift, per-class action error budget).

Supporting figure

OpenTelemetry GenAI semantic conventions promoted a subset of GenAI spans to stable on 13 Mar 2026 — the protocol layer for agent observability is now settled

Date

29 Apr 2026

Verdict

Partial(AM-114)

Next review

28 Jun 2026(+60d)

Most 2026 enterprise agentic-AI deployments are instrumented for two of the four observability layers production work requires. The cheaper subset (infrastructure plus LLM-call telemetry) is already in place because Datadog, New Relic, Helicone, and the like sit on top of existing investments. The more expensive subset (multi-step trace observability plus output-distribution drift detection) is often deferred, tagged “Phase 2,” or scoped out entirely on the assumption that it can be added when something breaks.

The failure modes the missing layers catch are the ones that produce the next regulatory enforcement headline. This piece is the instrumentation companion to the agent SLA architecture (AM-110): the SLA piece argues for four metrics; this piece argues for the four observability layers that produce them. A procurement team signs the SLA; an engineering team instruments the layers. Both sides have to be in place for the agreement to be operationally real.

Report: the four-layer observability pattern

Each layer has a settled tool category in 2026. The category structure has converged across vendors in a way the SLA contract surface has not.

Layer 1: infrastructure observability. CPU, memory, network, disk, container health, GPU utilisation, queue depth on whatever orchestrates the agent runtime. The existing enterprise platform covers this layer regardless of vendor: Datadog, New Relic, Dynatrace, Grafana plus Prometheus, Splunk. Layer 1 is solved as a category; the question is whether the agent runtime is captured alongside the rest of the application stack.

Layer 2: LLM-call observability. Per-call data: which model was invoked, which version, prompt and completion content (subject to PII handling), input and output token counts, latency, cost, error or refusal status, retry behaviour. Specialist tools: Helicone, LangSmith, Langfuse, Arize. The infrastructure platforms have shipped extensions that cover the same surface natively: Datadog AI Observability and New Relic AI Monitoring both ingest OpenTelemetry GenAI spans directly. Specialist or absorbed-into-platform is a procurement question turning on existing licence posture and data-retention requirements, not on capability.

Layer 3: trace observability. The multi-step view: how an agent’s task decomposed into tool calls, which agents called which others, what intermediate state the reasoning chain produced, where in the chain a failure entered. The protocol layer is OpenTelemetry GenAI semantic conventions, promoted to stable for a subset of GenAI spans on 13 Mar 2026. Ingestion targets: Honeycomb for the trace-and-derived-fields workflow, AWS X-Ray with GenAI extensions, Google Cloud Vertex AI monitoring, and Arize Phoenix (managed or open-source self-host). The CNCF OpenTelemetry GenAI working group is where the convergence runs across Anthropic, Microsoft, Google, AWS, Datadog, New Relic, Honeycomb, Arize, and Galileo.

Layer 4: output observability. The population-level view: has the distribution of agent output characteristics shifted over the rolling window against the calibration baseline? Dimensions are task-dependent: citation counts and source diversity for a research agent, approved vendor counts for a procurement agent, message length and tone scores for a communication agent. Category tools: Galileo, Arize Phoenix, and the open-source Evidently AI library. Statistical-drift detectors compute population stability index, Jensen-Shannon divergence, or comparable measures. Datadog and New Relic have model-monitor capabilities at this layer in preview; depth of coverage in 2026 still favours the specialists.

The four layers are coplanar, not a Maslow’s hierarchy. Each has its own failure surface; each must be instrumented independently for the deployment to be observed completely.

Observe: what each layer sees vs misses

Most deployments are under-instrumented because visible failures point downward (Layer 1, where alerts are loud and on-call teams exist) and silent failures point upward (Layers 3 and 4, where the deployment learns about a failure from a customer or a regulator).

Layer 1 sees pipe health, misses everything model-shaped. A flat CPU graph tells the team the runtime is up; it cannot tell the team the agent is producing wrong actions. A 99.95% infrastructure-availability quarter and a high-incident agent-layer quarter coexist routinely.

Layer 2 sees per-call cost and latency, misses the chain. Helicone or LangSmith will show a call to Claude Sonnet returned in 4.2 seconds and produced 1,847 output tokens for $0.029. It will not show that the call was the third in a four-step chain where the second step’s output was malformed and the third compensated by hallucinating. Layer 2 catches per-call regression; it misses the cross-call reasoning failure most associated with multi-agent and tool-using deployments.

Layer 3 sees the full chain, misses the population question. Honeycomb on OpenTelemetry GenAI traces shows that this specific agent on this specific request called these tools in this order with these intermediate values. It does not, on its own, tell the deployment that the population of all such traces over the rolling 30-day window has shifted in a statistically meaningful way. Trace observability is per-incident explainability; the population view requires Layer 4.

Layer 4 sees the population, misses individual-trace explainability. Galileo, Arize Phoenix, or Evidently fires when the output distribution drifts past threshold. It does not tell the deployment which specific trace caused the drift to cross — that requires joining the alert to Layer 3 trace data. Layer 4 is the leading indicator; Layer 3 is the diagnostic.

Removing any layer removes a failure-mode class from the detectable surface. The 88% incident rate in the Gravitee State of AI Agent Security report for 2026 enterprise deployments is largely a Layer 3 and Layer 4 detection-gap story: the platform layer was operating well, the agent layer was producing the incidents, and the deployment had no instrument pointed at where the failure occurred.

Reflect: the under-instrumented common case

The pattern across 2026 enterprise agentic deployments is recognisable. The IT organisation has a mature Layer 1 platform from an earlier observability cycle. It added Layer 2 since 2024 because cost-per-token visibility became a finance-side requirement. It has partial Layer 3 because the platform vendor shipped GenAI tracing and the team enabled it. Layer 4 is unbudgeted because the category did not exist when the original roadmap was drafted. The configuration measures what was easy to measure; it does not measure the failure surface specific to autonomous agents.

The reflexive reading at IT-leader level is that this is a tooling-roadmap problem solvable through procurement. It is, in part. The deeper observation is that procurement is buying tools without reference to a coherent four-layer model, which produces overlap and gaps simultaneously: a Layer 2 specialist and platform-native Layer 2 may both be live while Layer 4 is missing. A deployment running Helicone alongside Datadog AI Observability is paying twice for Layer 2 and zero times for Layer 4.

The regulatory observation is sharper. EU AI Act Article 12 obligations on automated event logging traceable to specific outputs are unmeetable without Layer 3 trace observability against per-agent identity (the AM-037 piece on non-human identity covers the identity precondition). Article 9 obligations on continuous risk management implicitly require Layer 4 drift detection. Market-surveillance authorities operating the Article 17 quality-management evidence base from 2 Aug 2026 onward will look at the evidence the deployment can produce; a Layers-1-and-2-only deployment cannot produce evidence on obligations that turn on Layers 3 and 4.

The template below maps each layer to a deliverable, a cost band, and a tool shortlist. Cost bands are order-of-magnitude annual ranges for a mid-to-large enterprise running production agentic workloads .

Line 1, infrastructure observability (Layer 1). Deliverable: agent runtime, orchestrator, and inference infrastructure visible on the existing observability dashboard with the same alerting and on-call routing as the rest of the application stack. Cost band: typically zero incremental cost on top of the existing platform licence; the work is configuration. Tool shortlist: whatever the deployment already runs (Datadog, New Relic, Dynatrace, Grafana plus Prometheus, Splunk). Build option: existing platform plus an OpenTelemetry collector wired to the agent runtime.

Line 2, LLM-call observability (Layer 2). Deliverable: per-call telemetry on every model invocation with model version, token counts, latency, cost, content (subject to PII handling), and OpenTelemetry GenAI span emission. Cost band: $20K to $150K per year for a specialist (Helicone, LangSmith, Langfuse, Arize), or absorbed into the existing platform cost if Datadog AI Observability or New Relic AI Monitoring is already active . Tool shortlist: Helicone or Langfuse for cost-conscious or self-host preference; LangSmith for LangChain-native deployments; Arize for the unified-vendor case with Layers 3 and 4. Build option: feasible but rarely worth it; the protocol layer is settled and build cost exceeds licence cost at most volumes.

Line 3, trace observability (Layer 3). Deliverable: full multi-step reasoning chain visible per agent task, with tool calls, intermediate state, agent-to-agent delegation, and OpenTelemetry GenAI span format throughout. Cost band: $50K to $300K per year as a specialist (Honeycomb, Arize Phoenix managed), or lower marginal cost when absorbed into a cloud-platform AI-monitoring extension (AWS X-Ray with GenAI extensions, Google Cloud Vertex AI monitoring, Datadog AI Observability tracing) . Tool shortlist: Honeycomb for derived-fields workflow, Arize Phoenix for agent-trace specialisation (open-source self-host or managed), AWS X-Ray for AWS-native, Vertex AI monitoring for GCP-native. Build option: feasible on OpenTelemetry GenAI plus a self-hosted Phoenix or Tempo backend; moderate build cost, justified for high-volume or sensitive-data deployments.

Line 4, output observability (Layer 4). Deliverable: rolling-window output-distribution drift detection against a deployment-specific calibration baseline, with population stability index, Jensen-Shannon divergence, or comparable measures, and alerting wired to the incident-response runbook. Cost band: $30K to $200K per year for a specialist (Galileo, Arize Phoenix evals); the open-source Evidently library is no-cost on licence but carries operational engineering cost . Tool shortlist: Galileo for purpose-built agent-output evaluation, Arize Phoenix for unified Layers 3 and 4, Evidently for open-source self-host with engineering capacity. Build option: feasible on Evidently as the drift-detector library plus deployment-side baseline maintenance; build cost is high because the calibration prompt set and threshold tuning are the labour-intensive parts.

The four lines together are the instrumentation contract that supports the four-metric SLA contract. A vendor signs the SLA; an engineering team ships the four layers for the SLA to be operationally real. Agent-platform contracts signed in 2026 without a parallel four-layer instrumentation programme are formally covered and operationally undefended.

How the four layers compose into the four SLA metrics

SLA metric (AM-110)	Primary layer	Supporting layers
Action-bounded availability	Layer 3 with action-class label	Layer 2 for per-call latency; Layer 1 for infrastructure baseline
MTTD-for-Agents	Layers 2, 3, 4 (signal aggregation)	Layer 2 action volume; Layer 3 tool-use distribution; Layer 4 output distribution
Output-distribution drift	Layer 4	Layer 3 for trace-level drill-down on drift incidents
Per-class action error budget	Layer 3 plus Layer 4 correctness eval	Layer 2 for per-call cost contribution

A deployment without Layer 4 cannot compute MTTD-for-Agents or output-distribution drift. A deployment without Layer 3 cannot compute action-bounded availability or per-class action error budget. The instrumentation gap is the SLA gap, expressed in engineering rather than procurement language.

The companion pieces: the agent SLA architecture (AM-110) is the procurement and contract surface; the non-human identity playbook (AM-037) is the per-agent identity precondition for action-class labelling at Layer 3; the MTTD-for-Agents framework is the detection-time discipline the four layers feed. Together they compose the operational definition of “production-ready” for agentic-AI in enterprise environments.

Holding-up note

The primary claim of this piece is logged at AM-114 on the Holding-up ledger on a 60-day review cadence. Three kinds of evidence would move the verdict:

A platform consolidation in which a single vendor ships credible coverage across all four layers, reducing the “common case” gap by removing the procurement coordination problem. Datadog and New Relic both moved closer to this profile in Q1 2026; whether the depth at Layer 4 reaches specialist parity is the open question.
A standards-body publication that defines agent observability primitives explicitly at the population and drift layer, equivalent to what OpenTelemetry GenAI semantic conventions did for the trace layer. The CNCF OpenTelemetry GenAI working group output in late 2026 is the candidate.
A regulator enforcement action in which an inadequate Layer 3 or Layer 4 observability posture was the in-scope finding. EU AI Act Article 9 and Article 17 obligations after 2 Aug 2026 will reveal whether market-surveillance authorities treat Layer 4 drift detection as part of the quality-management evidence base.

Next review: 28 Jun 2026.

ShareX / Twitter LinkedIn Email

Correction log

29 Apr 2026Initial publication 29 Apr 2026. Initial verdict 'Partial' — the four-layer model is observable from current 2026 tool categories and OpenTelemetry GenAI convergence, but the procurement-or-build cost bands are publication estimates and have not been tested across a representative sample of enterprise deployments. REVIEW: Peter — please verify (1) the OpenTelemetry GenAI stable-promotion date (13 Mar 2026) is consistent with what AM-110 cites; (2) the cost-band ranges in the §Share-thoughts template are defensible as our-estimate or need tightening; (3) Datadog AI Observability and New Relic AI Monitoring product names are current; (4) Arize Phoenix open-source/managed dual-form description is accurate; (5) the CNCF OpenTelemetry GenAI working-group framing matches the actual project structure before removing rewriteInProgress flag.

Spotted an error? See corrections policy →

Disagree with this piece?

Reasoned disagreement is a first-class signal here. Every review cycle weighs documented dissent; material dissent becomes part of the article's change history. This is not a corrections form — use /corrections/ for factual errors.

Part of the pillar

Agentic AI governance →

Governance frameworks, oversight patterns, and compliance postures for enterprise agentic-AI deployment. 39 other pieces in this pillar.