Agent observability stack: the four layers production agentic-AI actually needs (and what each one misses)
Production agentic-AI in 2026 needs four observability layers: infrastructure, LLM-call, trace, and output. Most enterprise deployments instrument only the cheaper subset. The failure modes the missing layers catch are the ones that produce the next regulatory enforcement headline.
Partial·reviewed29 Apr 2026·next+60dThis piece predates the current editorial standard and is in the rewrite queue. The body below is retained for link integrity while the new analysis is prepared. When the rewrite ships, the claim (AM-114) moves from Partial to Holding and the update is dated in the correction log.
Most 2026 enterprise agentic-AI deployments are instrumented for two of the four observability layers production work requires. The cheaper subset (infrastructure plus LLM-call telemetry) is already in place because Datadog, New Relic, Helicone, and the like sit on top of existing investments. The more expensive subset (multi-step trace observability plus output-distribution drift detection) is often deferred, tagged “Phase 2,” or scoped out entirely on the assumption that it can be added when something breaks.
The failure modes the missing layers catch are the ones that produce the next regulatory enforcement headline. This piece is the instrumentation companion to the agent SLA architecture (AM-110): the SLA piece argues for four metrics; this piece argues for the four observability layers that produce them. A procurement team signs the SLA; an engineering team instruments the layers. Both sides have to be in place for the agreement to be operationally real.
Report: the four-layer observability pattern
Each layer has a settled tool category in 2026. The category structure has converged across vendors in a way the SLA contract surface has not.
Layer 1: infrastructure observability. CPU, memory, network, disk, container health, GPU utilisation, queue depth on whatever orchestrates the agent runtime. The existing enterprise platform covers this layer regardless of vendor: Datadog, New Relic, Dynatrace, Grafana plus Prometheus, Splunk. Layer 1 is solved as a category; the question is whether the agent runtime is captured alongside the rest of the application stack.
Layer 2: LLM-call observability. Per-call data: which model was invoked, which version, prompt and completion content (subject to PII handling), input and output token counts, latency, cost, error or refusal status, retry behaviour. Specialist tools: Helicone, LangSmith, Langfuse, Arize. The infrastructure platforms have shipped extensions that cover the same surface natively: Datadog AI Observability and New Relic AI Monitoring both ingest OpenTelemetry GenAI spans directly. Specialist or absorbed-into-platform is a procurement question turning on existing licence posture and data-retention requirements, not on capability.
Layer 3: trace observability. The multi-step view: how an agent’s task decomposed into tool calls, which agents called which others, what intermediate state the reasoning chain produced, where in the chain a failure entered. The protocol layer is OpenTelemetry GenAI semantic conventions, promoted to stable for a subset of GenAI spans on 13 Mar 2026. Ingestion targets: Honeycomb for the trace-and-derived-fields workflow, AWS X-Ray with GenAI extensions, Google Cloud Vertex AI monitoring, and Arize Phoenix (managed or open-source self-host). The CNCF OpenTelemetry GenAI working group is where the convergence runs across Anthropic, Microsoft, Google, AWS, Datadog, New Relic, Honeycomb, Arize, and Galileo.
Layer 4: output observability. The population-level view: has the distribution of agent output characteristics shifted over the rolling window against the calibration baseline? Dimensions are task-dependent: citation counts and source diversity for a research agent, approved vendor counts for a procurement agent, message length and tone scores for a communication agent. Category tools: Galileo, Arize Phoenix, and the open-source Evidently AI library. Statistical-drift detectors compute population stability index, Jensen-Shannon divergence, or comparable measures. Datadog and New Relic have model-monitor capabilities at this layer in preview; depth of coverage in 2026 still favours the specialists.
The four layers are coplanar, not a Maslow’s hierarchy. Each has its own failure surface; each must be instrumented independently for the deployment to be observed completely.
Observe: what each layer sees vs misses
Most deployments are under-instrumented because visible failures point downward (Layer 1, where alerts are loud and on-call teams exist) and silent failures point upward (Layers 3 and 4, where the deployment learns about a failure from a customer or a regulator).
Layer 1 sees pipe health, misses everything model-shaped. A flat CPU graph tells the team the runtime is up; it cannot tell the team the agent is producing wrong actions. A 99.95% infrastructure-availability quarter and a high-incident agent-layer quarter coexist routinely.
Layer 2 sees per-call cost and latency, misses the chain. Helicone or LangSmith will show a call to Claude Sonnet returned in 4.2 seconds and produced 1,847 output tokens for $0.029. It will not show that the call was the third in a four-step chain where the second step’s output was malformed and the third compensated by hallucinating. Layer 2 catches per-call regression; it misses the cross-call reasoning failure most associated with multi-agent and tool-using deployments.
Layer 3 sees the full chain, misses the population question. Honeycomb on OpenTelemetry GenAI traces shows that this specific agent on this specific request called these tools in this order with these intermediate values. It does not, on its own, tell the deployment that the population of all such traces over the rolling 30-day window has shifted in a statistically meaningful way. Trace observability is per-incident explainability; the population view requires Layer 4.
Layer 4 sees the population, misses individual-trace explainability. Galileo, Arize Phoenix, or Evidently fires when the output distribution drifts past threshold. It does not tell the deployment which specific trace caused the drift to cross — that requires joining the alert to Layer 3 trace data. Layer 4 is the leading indicator; Layer 3 is the diagnostic.
Removing any layer removes a failure-mode class from the detectable surface. The 88% incident rate in the Gravitee State of AI Agent Security report for 2026 enterprise deployments is largely a Layer 3 and Layer 4 detection-gap story: the platform layer was operating well, the agent layer was producing the incidents, and the deployment had no instrument pointed at where the failure occurred.
Reflect: the under-instrumented common case
The pattern across 2026 enterprise agentic deployments is recognisable. The IT organisation has a mature Layer 1 platform from an earlier observability cycle. It added Layer 2 since 2024 because cost-per-token visibility became a finance-side requirement. It has partial Layer 3 because the platform vendor shipped GenAI tracing and the team enabled it. Layer 4 is unbudgeted because the category did not exist when the original roadmap was drafted. The configuration measures what was easy to measure; it does not measure the failure surface specific to autonomous agents.
The reflexive reading at IT-leader level is that this is a tooling-roadmap problem solvable through procurement. It is, in part. The deeper observation is that procurement is buying tools without reference to a coherent four-layer model, which produces overlap and gaps simultaneously: a Layer 2 specialist and platform-native Layer 2 may both be live while Layer 4 is missing. A deployment running Helicone alongside Datadog AI Observability is paying twice for Layer 2 and zero times for Layer 4.
The regulatory observation is sharper. EU AI Act Article 12 obligations on automated event logging traceable to specific outputs are unmeetable without Layer 3 trace observability against per-agent identity (the AM-037 piece on non-human identity covers the identity precondition). Article 9 obligations on continuous risk management implicitly require Layer 4 drift detection. Market-surveillance authorities operating the Article 17 quality-management evidence base from 2 Aug 2026 onward will look at the evidence the deployment can produce; a Layers-1-and-2-only deployment cannot produce evidence on obligations that turn on Layers 3 and 4.
Share thoughts: the four-line procurement-or-build template
The template below maps each layer to a deliverable, a cost band, and a tool shortlist. Cost bands are order-of-magnitude annual ranges for a mid-to-large enterprise running production agentic workloads .
Line 1, infrastructure observability (Layer 1). Deliverable: agent runtime, orchestrator, and inference infrastructure visible on the existing observability dashboard with the same alerting and on-call routing as the rest of the application stack. Cost band: typically zero incremental cost on top of the existing platform licence; the work is configuration. Tool shortlist: whatever the deployment already runs (Datadog, New Relic, Dynatrace, Grafana plus Prometheus, Splunk). Build option: existing platform plus an OpenTelemetry collector wired to the agent runtime.
Line 2, LLM-call observability (Layer 2). Deliverable: per-call telemetry on every model invocation with model version, token counts, latency, cost, content (subject to PII handling), and OpenTelemetry GenAI span emission. Cost band: $20K to $150K per year for a specialist (Helicone, LangSmith, Langfuse, Arize), or absorbed into the existing platform cost if Datadog AI Observability or New Relic AI Monitoring is already active . Tool shortlist: Helicone or Langfuse for cost-conscious or self-host preference; LangSmith for LangChain-native deployments; Arize for the unified-vendor case with Layers 3 and 4. Build option: feasible but rarely worth it; the protocol layer is settled and build cost exceeds licence cost at most volumes.
Line 3, trace observability (Layer 3). Deliverable: full multi-step reasoning chain visible per agent task, with tool calls, intermediate state, agent-to-agent delegation, and OpenTelemetry GenAI span format throughout. Cost band: $50K to $300K per year as a specialist (Honeycomb, Arize Phoenix managed), or lower marginal cost when absorbed into a cloud-platform AI-monitoring extension (AWS X-Ray with GenAI extensions, Google Cloud Vertex AI monitoring, Datadog AI Observability tracing) . Tool shortlist: Honeycomb for derived-fields workflow, Arize Phoenix for agent-trace specialisation (open-source self-host or managed), AWS X-Ray for AWS-native, Vertex AI monitoring for GCP-native. Build option: feasible on OpenTelemetry GenAI plus a self-hosted Phoenix or Tempo backend; moderate build cost, justified for high-volume or sensitive-data deployments.
Line 4, output observability (Layer 4). Deliverable: rolling-window output-distribution drift detection against a deployment-specific calibration baseline, with population stability index, Jensen-Shannon divergence, or comparable measures, and alerting wired to the incident-response runbook. Cost band: $30K to $200K per year for a specialist (Galileo, Arize Phoenix evals); the open-source Evidently library is no-cost on licence but carries operational engineering cost . Tool shortlist: Galileo for purpose-built agent-output evaluation, Arize Phoenix for unified Layers 3 and 4, Evidently for open-source self-host with engineering capacity. Build option: feasible on Evidently as the drift-detector library plus deployment-side baseline maintenance; build cost is high because the calibration prompt set and threshold tuning are the labour-intensive parts.
The four lines together are the instrumentation contract that supports the four-metric SLA contract. A vendor signs the SLA; an engineering team ships the four layers for the SLA to be operationally real. Agent-platform contracts signed in 2026 without a parallel four-layer instrumentation programme are formally covered and operationally undefended.
How the four layers compose into the four SLA metrics
| SLA metric (AM-110) | Primary layer | Supporting layers |
|---|---|---|
| Action-bounded availability | Layer 3 with action-class label | Layer 2 for per-call latency; Layer 1 for infrastructure baseline |
| MTTD-for-Agents | Layers 2, 3, 4 (signal aggregation) | Layer 2 action volume; Layer 3 tool-use distribution; Layer 4 output distribution |
| Output-distribution drift | Layer 4 | Layer 3 for trace-level drill-down on drift incidents |
| Per-class action error budget | Layer 3 plus Layer 4 correctness eval | Layer 2 for per-call cost contribution |
A deployment without Layer 4 cannot compute MTTD-for-Agents or output-distribution drift. A deployment without Layer 3 cannot compute action-bounded availability or per-class action error budget. The instrumentation gap is the SLA gap, expressed in engineering rather than procurement language.
The companion pieces: the agent SLA architecture (AM-110) is the procurement and contract surface; the non-human identity playbook (AM-037) is the per-agent identity precondition for action-class labelling at Layer 3; the MTTD-for-Agents framework is the detection-time discipline the four layers feed. Together they compose the operational definition of “production-ready” for agentic-AI in enterprise environments.
Holding-up note
The primary claim of this piece is logged at AM-114 on the Holding-up ledger on a 60-day review cadence. Three kinds of evidence would move the verdict:
- A platform consolidation in which a single vendor ships credible coverage across all four layers, reducing the “common case” gap by removing the procurement coordination problem. Datadog and New Relic both moved closer to this profile in Q1 2026; whether the depth at Layer 4 reaches specialist parity is the open question.
- A standards-body publication that defines agent observability primitives explicitly at the population and drift layer, equivalent to what OpenTelemetry GenAI semantic conventions did for the trace layer. The CNCF OpenTelemetry GenAI working group output in late 2026 is the candidate.
- A regulator enforcement action in which an inadequate Layer 3 or Layer 4 observability posture was the in-scope finding. EU AI Act Article 9 and Article 17 obligations after 2 Aug 2026 will reveal whether market-surveillance authorities treat Layer 4 drift detection as part of the quality-management evidence base.
Next review: 28 Jun 2026.
Correction log
- 29 Apr 2026Initial publication 29 Apr 2026. Initial verdict 'Partial' — the four-layer model is observable from current 2026 tool categories and OpenTelemetry GenAI convergence, but the procurement-or-build cost bands are publication estimates and have not been tested across a representative sample of enterprise deployments. REVIEW: Peter — please verify (1) the OpenTelemetry GenAI stable-promotion date (13 Mar 2026) is consistent with what AM-110 cites; (2) the cost-band ranges in the §Share-thoughts template are defensible as our-estimate or need tightening; (3) Datadog AI Observability and New Relic AI Monitoring product names are current; (4) Arize Phoenix open-source/managed dual-form description is accurate; (5) the CNCF OpenTelemetry GenAI working-group framing matches the actual project structure before removing rewriteInProgress flag.
Spotted an error? See corrections policy →
Reasoned disagreement is a first-class signal here. Every review cycle weighs documented dissent; material dissent becomes part of the article's change history. This is not a corrections form — use /corrections/ for factual errors.
Agentic AI governance →
Governance frameworks, oversight patterns, and compliance postures for enterprise agentic-AI deployment. 39 other pieces in this pillar.