Skip to content
Method: every claim tracked, reviewed every 30–90 days, marked Holding, Partial, or Not holding. Drafted by Claude; signed off by Peter. How this works →
AM-110pub29 Apr 2026rev29 Apr 2026read9 mininAI Implementation

Agent SLA architecture: what 'production-ready' actually means for autonomous, non-deterministic actors

Traditional SLAs were drafted against deterministic systems. Autonomous agents produce variable outputs by design. The four metrics that actually work for agents are action-bounded availability, MTTD-for-Agents, output-distribution drift, and per-class action error budget. Vendors that cannot expose these are not yet production-ready.

Holding·reviewed29 Apr 2026·next+60d

A working definition of “production-ready” for autonomous agents is overdue. Enterprises are signing agent-platform contracts in 2026 against SLA language drafted for a class of system the agent does not belong to. The gap produces two failure modes: the contract demands a guarantee the vendor cannot meaningfully sign (99.9% correctness on a non-deterministic system, where correctness is not even defined), or it demands a guarantee the customer cannot meaningfully verify (an SLO on a metric the platform does not expose at the action layer). Both leave the deployment formally covered and operationally undefended.

This piece sets out the four metrics that map onto autonomous agent behaviour the way uptime and latency mapped onto deterministic systems, plus the four-line SLA template that puts those metrics inside an enforceable contract. The detection-time companion (the publication’s MTTD-for-Agents framework) covers the latency-of-detection metric in depth; this piece situates MTTD inside the broader four-metric architecture.

Report: how SRE drafted the metrics that no longer fit

The Google Site Reliability Engineering handbook and the follow-on SRE Workbook are the reference text for service-level objectives. The SRE definition is precise: a service-level indicator is a measured property of the service, a service-level objective is a target value over a time window, and an error budget is the inverse. The canonical examples are availability, latency, and error rate.

The pattern works because the underlying system is deterministic. A request that succeeds for one customer succeeds for the next, given the same input and code path. Twenty years of operational experience inside Google and the wider SRE community refined this model to its current shape.

Autonomous agents break the precondition. The OpenTelemetry project recognised this with the GenAI semantic conventions, promoted to stable for a subset of GenAI spans on 13 Mar 2026 after two years of vendor convergence. Anthropic, Microsoft, Google, AWS, Datadog, New Relic, Honeycomb, Arize, and Galileo all participate; the protocol is not contested. What the conventions deliberately do not do is define an SLO. Span shape is necessary for measurement; metric definitions and the contractual surface are downstream questions the procurement and SRE communities still own. The vacuum is what this piece is about.

Observe: the structural reason uptime and p95 latency miss the agent failure mode

Three observations from 2025 and Q1 2026 deployments converge.

Vendor uptime is decoupled from agent correctness. A vendor can hit 99.95% API availability for a quarter while the deployment is producing wrong actions. The Cloud Security Alliance MAESTRO threat-modelling guidance describes the same gap from the security side: “the agent took an action” and “the agent took the right action” are different propositions, and only the first is observable from the request layer. The 88% incident rate across 2026 enterprise agentic deployments documented in the Gravitee State of AI Agent Security report describes deployments where the platform layer was operating well within SLA while the agent layer was producing the incidents.

P95 latency is calibrated to the wrong unit. Request-layer latency captures time to first token or time to a complete response. For an agent that calls four tools in sequence, the meaningful latency is time to action completion across all four plus any human-in-loop confirmation gates. A deployment can hit excellent p95 token latency while action-completion p95 sits hours past budget. The mismatch shows up most often in financial-workflow deployments where the contractually relevant latency is end-to-end approval-to-settlement.

Error rate undercounts agent failure. A request that returns a 200 response and a syntactically valid output is, by the SRE definition, successful. The agent-layer question (was the output the right one) sits outside the indicator. Enterprises report request-layer error rates well under 1% while action-layer error rates against ground truth land in the 5% to 25% range depending on task class.

Existing SLO primitives are not wrong; they are incomplete. The platform layer needs them. The agent layer needs four additional primitives the SRE handbook did not have to define.

Reflect: the four metrics that map onto autonomous agent behaviour

Each metric has a precedent inside the SRE, observability, or evaluation communities. The editorial value is the bundling and the discipline that all four ship together.

Action-bounded availability. The fraction of attempted actions inside a defined action class that complete inside the per-class budget, including all tool calls, approval gates, and bounded retries. The unit is the action class, not the request. A class is a customer-defined grouping (procurement-write, knowledge-base-read, financial-system-write, customer-communication) with its own latency budget and success criteria. The metric is computed on top of OpenTelemetry GenAI spans inside the deployment’s observability platform.

MTTD-for-Agents. The median hours between an anomalous agent behaviour occurring and the deploying organisation detecting it, with the publication’s targets of 4 hours for large enterprise and 24 hours for mid-market. The four tripwires (scope drift, permission creep, data-exfiltration pattern, cross-agent privilege echo) and the five-phase detection chain are at the MTTD-for-Agents framework page. MTTD pairs with action-bounded availability: the first measures whether the agent is acting inside its envelope, the second how fast the deployment notices when it is not.

Output-distribution drift. The statistical shift in agent output characteristics over a rolling window, measured against a baseline established during calibration. The dimensions are task-dependent: for a procurement agent, the distribution of approved vendor counts per request; for a research agent, citation counts and source diversity; for a communication agent, message length and tone classifier scores. Drift detection ships in Arize Phoenix, Galileo Evaluate, and the open-source Evidently library, computing population stability index, Jensen-Shannon divergence, or comparable measures. Drift is the leading indicator for silent vendor model updates and for deployment-side prompt or tool-surface changes that produce unintended shifts.

Per-class action error budget. A borrowed SRE primitive applied at the action class layer rather than the request layer. For each class, the deployment defines a target action-correctness rate (typically 95% to 99% depending on blast radius), an error budget that is the inverse, and an exhaustion response in the runbook. The exhaustion response is the discipline-bearing part: when the budget is exhausted inside the rolling window, the documented response is to halt new deployments to that action class until the budget is restored, identical to the SRE practice of halting feature work when an SLO budget is burnt. Without the exhaustion response, the metric is decorative.

Together the four cover the surface uptime + latency leave open: platform-to-action delivery, detection latency, behavioural stability, and action-correctness. None is sufficient alone; together they form the operational definition of “production-ready” the procurement question demands.

Share thoughts: the four-line SLA template for an agent-vendor contract

Procurement teams negotiating an agent-platform contract in 2026 should require the vendor to support all four metrics in the contract surface. The template below is four lines a procurement team can paste into the standard MSA addendum.

Line 1, telemetry for action-bounded availability. The vendor exposes per-action-class completion telemetry, with action-class taxonomy defined by the customer, at sub-minute granularity, in OpenTelemetry GenAI span format or a documented equivalent. The target is set by the deployment; the vendor commits to the telemetry.

Line 2, signals for MTTD-for-Agents. The vendor exposes the four detection signals (action volume, tool-use distribution, cost-per-action, output distribution) at sub-hourly granularity in a queryable form, with retention sufficient to support the deployment’s incident-response window. The deployment owns its detection target; the vendor commits to the signals.

Line 3, output-distribution baseline notification. The vendor publishes a baseline for output-distribution characteristics on a customer-supplied calibration prompt set, and notifies the customer in advance of any model update that internal evaluation indicates will invalidate the baseline. The deployment owns re-calibration; the vendor commits to notification.

Line 4, per-class error budget framework support. The vendor supports per-action-class error budgeting in the runtime (the ability to halt new deployments to an action class without halting the platform) and documents the exhaustion-response primitives (action-class disable, traffic shaping, rollback to prior agent version). The customer owns the budget definition and runbook; the vendor commits to the runtime primitives.

The template is deliberately not an availability or correctness commitment. A vendor cannot meaningfully sign a 99.9% correctness SLA on a non-deterministic system because correctness is not a property the model alone determines; it is a property of the prompt, the tool surface, the data scope, and the vendor’s model behaviour together. The template commits the vendor to the four telemetry and runtime surfaces the deployment needs to instrument the metrics itself. This is the correct division of responsibility: the platform layer is the vendor’s, the agent layer is the deployment’s, and the SLA is the contract that lets the deployment own its layer without the vendor’s being opaque.

A vendor that meets all four lines is production-ready against this procurement bar. A vendor that meets fewer than three is still useful for proof-of-concept and non-production workloads, but procurement should be explicit in the business case about the metrics the deployment cannot measure under that vendor’s surface and the residual risk that opacity carries.

How the metrics map to existing observability platforms

MetricOpenTelemetry inputDatadogNew RelicHoneycombArize / Galileo / EvidentlyGrafana
Action-bounded availabilityGenAI spans + action-class labelLLM Observability custom metricsAI Monitoring custom dashboardsTrace-derived metrics + boardsAction-trace integrationPrometheus exporter + dashboard
MTTD-for-AgentsGenAI spans + tool-call attributesAnomaly detection on the four signalsAnomaly detection + alert routingBubbleUp on signal anomaliesTripwire instrumentation nativeAlertmanager on signal queries
Output-distribution driftGenAI output attribute + eval setModel monitor (preview)Custom drift dashboardDerived field on output attributeNative drift detectorsStatistical-drift panel
Per-class action error budgetGenAI span + correctness evalSLO with custom indicatorSLM with custom queriesSLO dashboardEval-derived correctnessSLO panel + Alertmanager

None of the four metrics requires tooling that does not exist. Each is an instrumentation discipline applied to the existing platform; the precondition is that the agent runtime emits OpenTelemetry GenAI spans and per-action-class labels.

The companion piece on the identity layer at /non-human-identity-ai-agents/ (AM-037) and the action-authority control bundle at /your-ai-agents-just-approved-2-7m-in-vendor-payments-and-other-nightmares-keeping-cisos-awake/ (AM-063) sit one layer below this SLA architecture: identity is the precondition for per-agent action labels; the action-authority controls are the precondition for the per-class blast-radius taxonomy. The three pieces compose into the operational stack a procurement team needs the vendor to support.

Holding-up note

The primary claim of this piece is logged at AM-110 on the Holding-up ledger on a 60-day review cadence. Three kinds of evidence would move the verdict:

  1. A platform-vendor SLA release committing to action-correctness or output-distribution stability as a contractual surface, not just a telemetry one. A commitment from any major vendor would shift the responsibility allocation this piece argues for.
  2. A standards-body publication that defines agent SLA primitives explicitly. NIST AI RMF revisions, ISO/IEC AI quality standards, or an SRE community publication would qualify.
  3. A regulator enforcement action whose in-scope finding was an inadequate agent-layer SLA on a deployment producing harm. EU AI Act Article 9 and Article 17 obligations after 2 Aug 2026 will reveal whether market-surveillance authorities treat the four-metric surface as part of the quality-management evidence base.

Next review: 28 Jun 2026.

ShareX / TwitterLinkedInEmail

Correction log

  1. 29 Apr 2026Initial publication 29 Apr 2026. Initial verdict 'Partial' — the four metrics are observable from current SRE/OTel practice but have not been tested as a procurement bar against 2026 vendor SLAs yet. REVIEW: Peter — please verify claim text + cited primary sources (especially the OpenTelemetry GenAI stable-promotion date and the Anthropic/MS Agent Framework reliability docs) before removing rewriteInProgress flag.

Spotted an error? See corrections policy →

Disagree with this piece?

Reasoned disagreement is a first-class signal here. Every review cycle weighs documented dissent; material dissent becomes part of the article's change history. This is not a corrections form — use /corrections/ for factual errors.

Part of the pillar

Agentic AI governance

Governance frameworks, oversight patterns, and compliance postures for enterprise agentic-AI deployment. 35 other pieces in this pillar.

Related reading

Vigil · 70 reviewed