Why don't traditional uptime and latency SLAs work for agentic AI?

Uptime and p95 latency are property-of-the-pipe metrics. They tell the customer the request reached the model and a response came back. Neither tells the customer whether the response was the right one. For a deterministic system that distinction does not matter; the same input produces the same output and correctness is a property of the code path. For an autonomous agent, the same input can produce different action sequences across invocations because reasoning is part of the runtime. The IT-leader question is no longer 'did the system answer?' but 'did the agent take the right action and not take the wrong one?' — and uptime cannot represent the second half of that question.

What are the four metrics that actually work for agents?

Action-bounded availability (the percentage of attempted high-blast-radius actions that completed within the per-class budget, measured per action class rather than per request); MTTD-for-Agents (the median hours between an anomalous agent behaviour occurring and the deploying organisation detecting it, with published targets of 4 hours for enterprise and 24 hours for mid-market — the publication's existing framework at /mttd/); output-distribution drift (the statistical shift in agent output characteristics over a rolling window, set against a baseline established during calibration); and per-class action error budget (a borrowed SRE primitive applied at the action class layer rather than the request layer, with a defined exhaustion response). The four together cover the surface uptime + latency leave open.

How do these metrics map to existing observability stacks?

Action-bounded availability is a custom metric on top of OpenTelemetry GenAI semantic conventions, which were promoted to stable for a subset of GenAI spans on 13 Mar 2026. Datadog LLM Observability and New Relic AI Monitoring both ingest OpenTelemetry GenAI spans natively. MTTD-for-Agents pairs with Honeycomb's tracing model and the Galileo / Arize agent observability tooling for the detection layer. Output-distribution drift uses statistical-drift detectors that ship in Arize Phoenix, Galileo Evaluate, and the open-source Evidently library. Per-class action error budget is a Grafana or Datadog dashboard on top of the per-action labels emitted from the agent runtime. None of the four requires a category of tooling that does not exist; each is an instrumentation discipline applied to the existing stack.

What about Anthropic and Microsoft — what do their reliability docs cover?

Anthropic's published reliability commitments cover model availability and API uptime. The Claude API status surface and the engineering posts on the same surface treat reliability as request-completion reliability. Microsoft Agent Framework documentation covers agent runtime reliability, retry policies, and tool-call failure handling. Neither vendor exposes a contractual surface for action-correctness or output-distribution drift, because both metrics are downstream of model behaviour the vendor cannot guarantee. The implication for procurement teams is that vendor SLAs cover the platform layer; the four agent-specific metrics described in this piece are deployment-side and the customer's responsibility to instrument. The piece's procurement template makes that division of responsibility explicit.

What is the four-line SLA template for an agent vendor contract?

Four lines, each tied to one of the four metrics: (1) the vendor exposes per-action-class completion telemetry sufficient for the customer to compute action-bounded availability against a customer-defined action taxonomy; (2) the vendor exposes the signals MTTD-for-Agents requires (action volume, tool-use distribution, cost-per-action, output distribution) at sub-hourly granularity; (3) the vendor commits to a published baseline for output-distribution characteristics on the deployment's calibration prompts and notifies the customer when the baseline is invalidated by a model update; (4) the vendor supports a per-action-class error-budget framework with documented exhaustion response. Vendors that cannot meet all four are still useful for non-production workloads; they are not yet production-ready against the procurement bar this piece argues for.

Agentic AI SLAs: the four metrics for autonomous actors

A working definition of “production-ready” for autonomous agents is overdue. Enterprises are signing agent-platform contracts in 2026 against SLA language drafted for a class of system the agent does not belong to. The gap produces two failure modes: the contract demands a guarantee the vendor cannot meaningfully sign (99.9% correctness on a non-deterministic system, where correctness is not even defined), or it demands a guarantee the customer cannot meaningfully verify (an SLO on a metric the platform does not expose at the action layer). Both leave the deployment formally covered and operationally undefended.

This piece sets out the four metrics that map onto autonomous agent behaviour the way uptime and latency mapped onto deterministic systems, plus the four-line SLA template that puts those metrics inside an enforceable contract. The detection-time companion (the publication’s MTTD-for-Agents framework) covers the latency-of-detection metric in depth; this piece situates MTTD inside the broader four-metric architecture.

Report: how SRE drafted the metrics that no longer fit

The Google Site Reliability Engineering handbook and the follow-on SRE Workbook are the reference text for service-level objectives. The SRE definition is precise: a service-level indicator is a measured property of the service, a service-level objective is a target value over a time window, and an error budget is the inverse. The canonical examples are availability, latency, and error rate.

The pattern works because the underlying system is deterministic. A request that succeeds for one customer succeeds for the next, given the same input and code path. Twenty years of operational experience inside Google and the wider SRE community refined this model to its current shape.

Autonomous agents break the precondition. The OpenTelemetry project recognised this with the GenAI semantic conventions, promoted to stable for a subset of GenAI spans on 13 Mar 2026 after two years of vendor convergence. Anthropic, Microsoft, Google, AWS, Datadog, New Relic, Honeycomb, Arize, and Galileo all participate; the protocol is not contested. What the conventions deliberately do not do is define an SLO. Span shape is necessary for measurement; metric definitions and the contractual surface are downstream questions the procurement and SRE communities still own. The vacuum is what this piece is about.

Observe: the structural reason uptime and p95 latency miss the agent failure mode

Three observations from 2025 and Q1 2026 deployments converge.

Vendor uptime is decoupled from agent correctness. A vendor can hit 99.95% API availability for a quarter while the deployment is producing wrong actions. The Cloud Security Alliance MAESTRO threat-modelling guidance describes the same gap from the security side: “the agent took an action” and “the agent took the right action” are different propositions, and only the first is observable from the request layer. The 88% incident rate across 2026 enterprise agentic deployments documented in the Gravitee State of AI Agent Security report describes deployments where the platform layer was operating well within SLA while the agent layer was producing the incidents.

P95 latency is calibrated to the wrong unit. Request-layer latency captures time to first token or time to a complete response. For an agent that calls four tools in sequence, the meaningful latency is time to action completion across all four plus any human-in-loop confirmation gates. A deployment can hit excellent p95 token latency while action-completion p95 sits hours past budget. The mismatch shows up most often in financial-workflow deployments where the contractually relevant latency is end-to-end approval-to-settlement.

Error rate undercounts agent failure. A request that returns a 200 response and a syntactically valid output is, by the SRE definition, successful. The agent-layer question (was the output the right one) sits outside the indicator. Enterprises report request-layer error rates well under 1% while action-layer error rates against ground truth land in the 5% to 25% range depending on task class.

Existing SLO primitives are not wrong; they are incomplete. The platform layer needs them. The agent layer needs four additional primitives the SRE handbook did not have to define.

Reflect: the four metrics that map onto autonomous agent behaviour

Each metric has a precedent inside the SRE, observability, or evaluation communities. The editorial value is the bundling and the discipline that all four ship together.

Action-bounded availability. The fraction of attempted actions inside a defined action class that complete inside the per-class budget, including all tool calls, approval gates, and bounded retries. The unit is the action class, not the request. A class is a customer-defined grouping (procurement-write, knowledge-base-read, financial-system-write, customer-communication) with its own latency budget and success criteria. The metric is computed on top of OpenTelemetry GenAI spans inside the deployment’s observability platform.

MTTD-for-Agents. The median hours between an anomalous agent behaviour occurring and the deploying organisation detecting it, with the publication’s targets of 4 hours for large enterprise and 24 hours for mid-market. The four tripwires (scope drift, permission creep, data-exfiltration pattern, cross-agent privilege echo) and the five-phase detection chain are at the MTTD-for-Agents framework page. MTTD pairs with action-bounded availability: the first measures whether the agent is acting inside its envelope, the second how fast the deployment notices when it is not.

Output-distribution drift. The statistical shift in agent output characteristics over a rolling window, measured against a baseline established during calibration. The dimensions are task-dependent: for a procurement agent, the distribution of approved vendor counts per request; for a research agent, citation counts and source diversity; for a communication agent, message length and tone classifier scores. Drift detection ships in Arize Phoenix, Galileo Evaluate, and the open-source Evidently library, computing population stability index, Jensen-Shannon divergence, or comparable measures. Drift is the leading indicator for silent vendor model updates and for deployment-side prompt or tool-surface changes that produce unintended shifts.

Per-class action error budget. A borrowed SRE primitive applied at the action class layer rather than the request layer. For each class, the deployment defines a target action-correctness rate (typically 95% to 99% depending on blast radius), an error budget that is the inverse, and an exhaustion response in the runbook. The exhaustion response is the discipline-bearing part: when the budget is exhausted inside the rolling window, the documented response is to halt new deployments to that action class until the budget is restored, identical to the SRE practice of halting feature work when an SLO budget is burnt. Without the exhaustion response, the metric is decorative.

Together the four cover the surface uptime + latency leave open: platform-to-action delivery, detection latency, behavioural stability, and action-correctness. None is sufficient alone; together they form the operational definition of “production-ready” the procurement question demands.

Procurement teams negotiating an agent-platform contract in 2026 should require the vendor to support all four metrics in the contract surface. The template below is four lines a procurement team can paste into the standard MSA addendum.

Line 1, telemetry for action-bounded availability. The vendor exposes per-action-class completion telemetry, with action-class taxonomy defined by the customer, at sub-minute granularity, in OpenTelemetry GenAI span format or a documented equivalent. The target is set by the deployment; the vendor commits to the telemetry.

Line 2, signals for MTTD-for-Agents. The vendor exposes the four detection signals (action volume, tool-use distribution, cost-per-action, output distribution) at sub-hourly granularity in a queryable form, with retention sufficient to support the deployment’s incident-response window. The deployment owns its detection target; the vendor commits to the signals.

Line 3, output-distribution baseline notification. The vendor publishes a baseline for output-distribution characteristics on a customer-supplied calibration prompt set, and notifies the customer in advance of any model update that internal evaluation indicates will invalidate the baseline. The deployment owns re-calibration; the vendor commits to notification.

Line 4, per-class error budget framework support. The vendor supports per-action-class error budgeting in the runtime (the ability to halt new deployments to an action class without halting the platform) and documents the exhaustion-response primitives (action-class disable, traffic shaping, rollback to prior agent version). The customer owns the budget definition and runbook; the vendor commits to the runtime primitives.

The template is deliberately not an availability or correctness commitment. A vendor cannot meaningfully sign a 99.9% correctness SLA on a non-deterministic system because correctness is not a property the model alone determines; it is a property of the prompt, the tool surface, the data scope, and the vendor’s model behaviour together. The template commits the vendor to the four telemetry and runtime surfaces the deployment needs to instrument the metrics itself. This is the correct division of responsibility: the platform layer is the vendor’s, the agent layer is the deployment’s, and the SLA is the contract that lets the deployment own its layer without the vendor’s being opaque.

A vendor that meets all four lines is production-ready against this procurement bar. A vendor that meets fewer than three is still useful for proof-of-concept and non-production workloads, but procurement should be explicit in the business case about the metrics the deployment cannot measure under that vendor’s surface and the residual risk that opacity carries.

How the metrics map to existing observability platforms

Metric	OpenTelemetry input	Datadog	New Relic	Honeycomb	Arize / Galileo / Evidently	Grafana
Action-bounded availability	GenAI spans + action-class label	LLM Observability custom metrics	AI Monitoring custom dashboards	Trace-derived metrics + boards	Action-trace integration	Prometheus exporter + dashboard
MTTD-for-Agents	GenAI spans + tool-call attributes	Anomaly detection on the four signals	Anomaly detection + alert routing	BubbleUp on signal anomalies	Tripwire instrumentation native	Alertmanager on signal queries
Output-distribution drift	GenAI output attribute + eval set	Model monitor (preview)	Custom drift dashboard	Derived field on output attribute	Native drift detectors	Statistical-drift panel
Per-class action error budget	GenAI span + correctness eval	SLO with custom indicator	SLM with custom queries	SLO dashboard	Eval-derived correctness	SLO panel + Alertmanager

None of the four metrics requires tooling that does not exist. Each is an instrumentation discipline applied to the existing platform; the precondition is that the agent runtime emits OpenTelemetry GenAI spans and per-action-class labels.

The companion piece on the identity layer at /non-human-identity-ai-agents/ (AM-037) and the action-authority control bundle at /your-ai-agents-just-approved-2-7m-in-vendor-payments-and-other-nightmares-keeping-cisos-awake/ (AM-063) sit one layer below this SLA architecture: identity is the precondition for per-agent action labels; the action-authority controls are the precondition for the per-class blast-radius taxonomy. The three pieces compose into the operational stack a procurement team needs the vendor to support.

Holding-up note

The primary claim of this piece is logged at AM-110 on the Holding-up ledger on a 60-day review cadence. Three kinds of evidence would move the verdict:

A platform-vendor SLA release committing to action-correctness or output-distribution stability as a contractual surface, not just a telemetry one. A commitment from any major vendor would shift the responsibility allocation this piece argues for.
A standards-body publication that defines agent SLA primitives explicitly. NIST AI RMF revisions, ISO/IEC AI quality standards, or an SRE community publication would qualify.
A regulator enforcement action whose in-scope finding was an inadequate agent-layer SLA on a deployment producing harm. EU AI Act Article 9 and Article 17 obligations after 2 Aug 2026 will reveal whether market-surveillance authorities treat the four-metric surface as part of the quality-management evidence base.

Next review: 28 Jun 2026.

ShareX / Twitter LinkedIn Email

Correction log

29 Apr 2026Initial publication 29 Apr 2026. Initial verdict 'Partial' — the four metrics are observable from current SRE/OTel practice but have not been tested as a procurement bar against 2026 vendor SLAs yet. REVIEW: Peter — please verify claim text + cited primary sources (especially the OpenTelemetry GenAI stable-promotion date and the Anthropic/MS Agent Framework reliability docs) before removing rewriteInProgress flag.

Spotted an error? See corrections policy →

Disagree with this piece?

Reasoned disagreement is a first-class signal here. Every review cycle weighs documented dissent; material dissent becomes part of the article's change history. This is not a corrections form — use /corrections/ for factual errors.

Part of the pillar

Agentic AI governance →

Governance frameworks, oversight patterns, and compliance postures for enterprise agentic-AI deployment. 35 other pieces in this pillar.

Agent SLA architecture: what 'production-ready' actually means for autonomous, non-deterministic actors

Report: how SRE drafted the metrics that no longer fit

Observe: the structural reason uptime and p95 latency miss the agent failure mode

Reflect: the four metrics that map onto autonomous agent behaviour

How the metrics map to existing observability platforms

Holding-up note

Correction log

Agentic AI governance →

Related reading

Report: how SRE drafted the metrics that no longer fit

Observe: the structural reason uptime and p95 latency miss the agent failure mode

Reflect: the four metrics that map onto autonomous agent behaviour

Share thoughts: the four-line SLA template for an agent-vendor contract

How the metrics map to existing observability platforms

Holding-up note

Correction log

Score this governance picture on six instrumented dimensions.

Agentic AI governance →

Related reading

Learning AI by doing AI: 90 days of measured rework across two ventures

MCP and the coming standard for enterprise agent tooling

The State of Enterprise Agentic AI 2026

AI-written analysis, signed by a practitioner. One or two pieces a week.