Skip to content
Podcast · Episode 8 · 13:50

Which LLM provider actually stays up?

Twenty-four months of operational record across the five major foundation-model providers. AM-136 documents Anthropic, OpenAI, Google, AWS Bedrock, and Azure OpenAI. Every provider had at least one multi-hour outage that exceeded its own SLA-credit threshold. Single-provider dependency is the 2026 procurement risk; multi-provider routing is the mitigation; LiteLLM, OpenRouter, Portkey are the gateway-abstraction reference implementations.

Claims walked in this episode
  • AM-136 · Foundation-model uptime in 2026: the 24-month outage record across Anthropic, OpenAI, Google, AWS Bedrock, and Azure OpenAI(Holding)

ABBY

This is Agent Mode AI. I'm Abby. Today we're walking AM-136. Twenty-four months of operational record across the five major foundation-model providers. Anthropic, OpenAI, Google, AWS Bedrock, Azure OpenAI. The procurement question the record answers is whether single-provider dependency is a viable architecture for production agentic AI in 2026.

AVERY

I'm Avery. Frame the answer first.

ABBY

The procurement-defensible answer is no. Across the twenty-four-month window May 2024 to April 2026, every major provider experienced at least one multi-hour outage that exceeded the SLA-credit threshold defined in its published terms. Multi-provider routing with documented failover and hard-dollar incident liability above the standard SLA-credit cap is the mitigation. Single-provider preferred-vendor architecture is not the procurement-defensible default in 2026.

AVERY

Why the status pages do not tell procurement what they need to know.

ABBY

A vendor status page reports the operational state of the API gateway and the regional infrastructure surrounding the model. It does not report the operational state of the specific model the customer integrated against. A customer running production traffic through Claude 3.7 Sonnet during a Sonnet-specific degradation can be told by the status page that the API is operational while their production agent is returning errors or degraded responses. This is not a vendor failure. The status page reports what it reports. The procurement gap is that the customer's operational metric is not the gateway's uptime; it is the specific model's uptime against the specific prompts the customer's workload sends.

AVERY

So the customer's own observability is the leading indicator.

ABBY

The customer's own observability is the leading indicator. Per-model latency, per-model error rate, per-model degradation patterns, all separately from the gateway's published metrics. The agent observability comparison at AM-123, Langfuse, Arize, Helicone, LangSmith, walks the four-platform decision. The relevant procurement signal there is that the observability platform must capture per-model signals, not just gateway signals. A customer that relies on the vendor's status page as its operational signal has no leading indicator of model-specific issues until production traffic itself reveals them.

AVERY

The SLA-credit gap.

ABBY

The standard published SLA across major foundation-model providers commits to ninety-nine-point-nine percent monthly availability with credits capped at twenty-five to fifty percent of monthly fees on the affected service. The arithmetic on ninety-nine-point-nine is roughly forty-three minutes of allowed monthly downtime. The arithmetic on the credit cap, for an enterprise paying an indicative fifty thousand dollars per month against a single provider, is a maximum of twenty-five thousand returned in the worst month. A fraction of the operational impact a multi-hour outage produces on a customer-facing workload. Those figures are illustrative, not surveyed.

AVERY

What do enterprise workloads actually need.

ABBY

Customer-facing agentic deployments typically need ninety-nine-point-nine-five availability with hard-dollar incident liability above the SLA-credit cap, not service credits as the only remedy. The point-zero-five gap is roughly twenty-two minutes per month of additional downtime tolerance. Material on a workflow where every minute of downtime translates to a measurable customer impact. Negotiating up from ninety-nine-point-nine to ninety-nine-point-nine-five is contractually possible at the Enterprise tier of every major provider. The negotiation requires three procurement instruments: a documented business impact analysis showing per-minute cost of downtime, a named incident-severity-tier framework with response-time commitments, and an explicit hard-dollar liability ceiling that exceeds the SLA-credit cap.

AVERY

Walk the five providers briefly.

ABBY

Anthropic publishes incidents at status.anthropic.com. The operational pattern across 2024-2026 is a small number of multi-hour incidents per year, typically affecting a specific model version or region, with detailed postmortems published within seven to fourteen days. OpenAI publishes at status.openai.com. The operational pattern includes the broader incident class of credential-system or quota-system failures that affect the entire API surface, alongside model-specific incidents. Credential-system incidents are particularly procurement-relevant because they cannot be mitigated by switching models within the same provider.

AVERY

Google.

ABBY

Google Gemini API publishes at status.cloud.google.com within the broader Google Cloud Platform status surface. The operational pattern inherits the GCP regional fault-domain structure. Gemini-specific incidents are reported but require the customer to filter the broader status feed.

AVERY

The hyperscalers.

ABBY

AWS Bedrock publishes at the AWS Service Health Dashboard. Operational pattern inherits AWS regional infrastructure plus Bedrock-specific model availability across regions. Bedrock customers benefit from cross-region inference patterns that mitigate single-region failures within the provider but do not mitigate Bedrock-wide incidents. Azure OpenAI publishes at the Azure Status Dashboard. Operational pattern inherits Azure regional infrastructure plus the Azure-OpenAI integration layer. Azure OpenAI customers can deploy across multiple Azure regions for failover; that pattern mitigates regional incidents but does not mitigate the OpenAI-side issues that propagate through the Azure integration.

AVERY

The procurement-relevant read across all five.

ABBY

No provider is a clean choice for single-provider production deployment in 2026. The architecture has to assume failure across the provider graph regardless of which provider is chosen. The relative ranking depends on which workload class the customer cares about. Hyperscaler-backed offerings inherit the underlying cloud's regional fault domains plus model-specific failure modes. Frontier-lab APIs have a shorter operational history but more direct upgrade-path optionality on model versions. Both classes have material incidents in the twenty-four-month window.

AVERY

Three architectural patterns for multi-provider routing.

ABBY

Pattern one. Gateway abstraction. A gateway sits between the application and the foundation-model providers and routes traffic based on configurable rules. LiteLLM is the open-source reference implementation. OpenRouter and Portkey are commercial alternatives. The gateway provides a unified API surface that the application integrates against, with the underlying provider selection happening at routing time. Failover is observable, configurable per route, and testable in production. The procurement-defensible benefit is that the customer owns the routing logic and is not dependent on the vendor for failover capability.

AVERY

The cost of pattern one.

ABBY

The gateway becomes part of the customer's operational substrate. Its own observability requirements, its own latency overhead, typically fifty to one hundred and fifty milliseconds added per request, and its own change-management discipline. Enterprises running this pattern typically deploy the gateway in their own VPC with redundancy across availability zones, treating it as production-critical infrastructure equivalent to an API gateway or service mesh.

AVERY

Pattern two.

ABBY

Provider-side regional failover. Within a single vendor, the customer deploys across multiple regions and handles failover at the regional level. AWS Bedrock cross-region inference and Azure OpenAI multi-region deployment are the two production-grade implementations. Partial mitigation. It addresses regional failures within the vendor's fault domain but does not address vendor-wide incidents. A Bedrock-wide outage affects all regional deployments simultaneously. Appropriate for workloads where vendor-wide incidents are tolerable but regional incidents are not.

AVERY

Pattern three.

ABBY

Explicit multi-provider provisioning at the application layer. The application is built to support two or more model families with prompt-tested compatibility maintained as a deployment requirement. The customer maintains active provisioning with a primary and secondary provider; failover is application-driven. Most expensive to maintain because every prompt change requires testing across the supported providers. Most resilient because no shared failure mode with the gateway or any single vendor's regional architecture. Appropriate for high-availability customer-facing workloads where the per-incident cost exceeds the maintenance overhead of multi-provider support.

AVERY

The default for 2026 enterprise procurement.

ABBY

Pattern one, gateway abstraction, is the procurement default for most enterprise workloads in 2026. Pattern three is the higher-rigour option for the workloads that justify it. Pattern two is a partial mitigation appropriate for narrower workload classes than most enterprises initially assume.

AVERY

Three contract language additions for the 2026 AI MSA red-team checklist.

ABBY

First, hard-dollar incident liability above the SLA-credit cap. Named incident severity tiers, typically Sev-1 customer-impacting, Sev-2 degraded service, Sev-3 minor impact, Sev-4 informational, with per-tier response-time commitments and an explicit hard-dollar liability ceiling that exceeds the SLA-credit cap by a multiplier reflecting the customer's per-minute downtime cost. Achievable at the Enterprise tier of every major provider.

AVERY

Second.

ABBY

Non-degradation clauses covering model-deprecation events. The vendor's right to deprecate or update models is the customer's procurement risk. The 2026 standard is a contractually defined transition window, typically ninety days for major version changes, thirty days for minor, during which the customer's prior model version remains available alongside the new version. The clause is required because forced cutover with no transition window produces a regression-test sprint the customer's engineering organisation cannot consistently absorb on the vendor's schedule.

AVERY

Third.

ABBY

Right to multi-provider routing without contract penalty. Some 2024-vintage MSAs included vendor-exclusivity clauses that prohibited the customer from routing identical traffic through a competitor's model. Those clauses do not survive 2026 procurement diligence because they make pattern one and pattern three contractually infeasible. The procurement-defensible language explicitly preserves the customer's right to deploy multi-provider architectures and prohibits the vendor from using identical-traffic-routing as a basis for contract enforcement action.

AVERY

What does the AM-136 verdict look like.

ABBY

AM-136 is Holding on a thirty-day cadence. The reliability record is the kind of evidence that ages monthly, not quarterly. The procurement-defensible posture has to refresh against the most recent operational data the providers publish. Three triggers would shift the analysis. A foundation-model provider publishing a sustained ninety-nine-point-nine-nine operational record across twelve consecutive months would close the gap between the standard ninety-nine-point-nine and the enterprise-required ninety-nine-point-nine-five. A regulatory development requiring multi-provider provisioning for high-risk AI deployments under the EU AI Act would shift the procurement question from architectural choice to compliance requirement. A landmark vendor outage producing material customer harm and follow-on litigation would shift the SLA-credit-versus-hard-dollar-liability negotiation.

AVERY

The honest read on what to do Monday.

ABBY

Three actions. First, audit the current foundation-model architecture against the three patterns. If the deployment is single-provider with no routing layer, prioritise pattern one in the next quarterly engineering plan. Second, audit the current vendor MSA against the three contract additions. The renewal window before 2 August 2026 is the natural moment to negotiate them. Third, instrument the customer's own observability stack against per-model signals, not just gateway signals. The lead time on detecting model-specific degradation is the difference between a bounded incident and a customer-impact incident.

AVERY

Final word.

ABBY

AM-136 and the five vendor status pages are linked at agentmodeai dot com slash holding. The Sunday brief ships every week with what moved on the ledger.

AVERY

Holding-up. See you next Sunday.

Vigil · 59 reviewed