Skip to content
Method: every claim tracked, reviewed every 30–90 days, marked Holding, Partial, or Not holding. Drafted by Claude; signed off by Peter. How this works →
AM-123pub3 May 2026rev3 May 2026read15 mininAI Implementation

Agent observability in 2026: Langfuse, Arize, Helicone, and LangSmith — and the procurement decision that is not the eval decision

Evaluation tells you whether the agent is right. Observability tells you what the agent did. Production deployments need both, the procurement decisions are different, and conflating them produces SLA architecture that fails its first incident. The four credible 2026 observability platforms (Langfuse, Arize, Helicone, LangSmith) split cleanly on one structural axis: open-source-first vs SaaS-first. Helicone has just gone into maintenance mode.

Holding·reviewed3 May 2026·next+59d

The companion piece to this one (on agent evaluation frameworks) closes with the structural lesson that evaluation and observability are different products with different procurement decisions. This piece walks the observability side of that split. The eval/observability split is not a vendor-marketing convenience. It is the load-bearing distinction that determines whether a production agent’s first incident produces a defensible audit trail or a procurement post-mortem.

Most enterprise IT leaders read the LangChain State of AI Agents 2025 finding that 89 percent of organisations have implemented some form of observability and conclude the category is solved. The harder question, and the one this piece tries to answer, is which observability the 89 percent have actually implemented and whether it answers the question their first production incident is going to ask.

The standard 2024 observability question is “what did the application do.” The 2026 agent-observability question is the same question with three additional layers underneath it. What did the LLM call (and at what cost). What tools did the agent call (and in what order, and for what input). What did the orchestration framework decide (and against what state). Standard application-monitoring stacks (Datadog, Splunk, Dynatrace) capture the outer layer well. The three inner layers are what the four 2026 specialist platforms were built for, and they answer differently because they assume different things about who owns the audit substrate.

Why the eval/observability split matters

Evaluation answers “is the agent’s output right.” It is run against a fixed dataset, on a fixed cadence, with a fixed scoring rubric, and the result is a number you can regression-test. The eval substrate is engineering’s job because the question is structurally a test-suite question.

Observability answers “what did the agent do, in what order, with what cost, and what did each step return.” It is run continuously, on production traffic, with no fixed rubric, and the result is a stream of traces an SRE can query during an incident. The observability substrate is shared between engineering, SRE, and finance because the question is structurally a runtime question.

The two share infrastructure. Both ingest spans. Both store traces. Both expose dashboards. The conflation that breaks production deployments is the assumption that one tool can do both well at the price tier the buyer is willing to underwrite. The truth is that every one of the four platforms in this piece does some subset of evaluation, and every one of the four eval platforms in the companion piece does some subset of observability. Buying the wrong product for either job is what produces SLA architecture that fails its first incident.

The structural rule for buyers in 2026 is that the procurement decision is two decisions, not one. The platform that answers “is my agent right” is rarely the same platform that answers “what did my agent do at 3:14 AM when the customer’s API rate-limited.” A buyer that picks one and tries to extend it into the other typically ends up with a feature-matrix winner that does the wrong job at scale.

The four 2026 platforms

Langfuse, the open-source observability anchor

Langfuse (at langfuse.com and github.com/langfuse/langfuse) describes itself as “an open source LLM engineering platform” with “LLM Observability, metrics, evals, prompt management, playground, datasets.” The repository shows v3.172.1 as of 1 May 2026, 26.5k stars, 2.7k forks, MIT-licensed (with enterprise features in the /ee folder under separate terms). Self-hosting is documented for Docker Compose, Kubernetes with Helm, and Terraform templates for AWS, Azure, and GCP.

Pricing on the public pricing page is structured in four tiers. Hobby at $0/month with 50k units per month included, 30 days of data access, two users, and community support. Core at $29/month with 100k units per month, 90 days of data access, unlimited users, and in-app support. Pro at $199/month with the same 100k units, three years of data access, unlimited annotation queues, high rate limits, and SOC2 plus ISO27001 reports. Enterprise at $2,499/month with audit logs, SCIM API, and a dedicated support engineer. Overage pricing is graduated: $8/100k units up to 1M, $7/100k units 1M to 10M, $6.50/100k units 10M to 50M, $6/100k units beyond 50M.

Data residency is the procurement-relevant differentiator: “Data is stored in the US, EU, Japan, or the HIPAA-compliant US region depending on your selection.” That is the broadest residency posture of the four platforms in this piece.

Where Langfuse fits structurally: organisations that want an open-source-first observability stack with a self-host upgrade path that is well-documented and proven across the GitHub deployment evidence (26.5k stars, 92 watchers, two and a half years of release activity). The MIT licence covers the core. The enterprise features sit behind a separate licence in the /ee directory, which is editorially material — a buyer self-hosting Langfuse is on the MIT core; the SCIM API and audit-log features require either Cloud Enterprise or an enterprise self-host licence agreement.

Arize, the platform with the open-source / commercial split

Arize is the only one of the four platforms in this piece that ships two distinct products. Phoenix is the open-source observability framework, free, self-hosted, with user-managed resources. AX is the commercial platform, SaaS-first, with a free tier and paid tiers on top.

Arize AX is described in its own documentation as an “AI Engineering Platform” that lets engineers and product managers “observe, improve, and evaluate their AI agents and AI applications with confidence.” The platform organises functionality around four pillars: Observe (tracing across 30+ providers and frameworks, AI-powered search to identify problematic traces), Evaluate (continuous assessments against production traces, output guardrails), Improve (test datasets, structured experiment runs, CI/CD integration, versioned prompts), and Alyx (an AI assistant integrated throughout the platform for debugging and dashboard creation without query languages).

Pricing is structured at four tiers. Phoenix at $0 (open-source, self-host, user-managed). AX Free at $0 with 25k spans per month, 1 GB per month ingestion, 15-day retention. AX Pro at $50/month with 50k spans per month, 10 GB per month ingestion, 30-day retention. AX Enterprise at custom pricing with custom limits, dedicated support, uptime SLA, SOC2 reports plus HIPAA, training sessions, and the “adb Data Fabric.” Enterprise deployments offer geographic flexibility: “US or EU or CA” data regions, with data residency and multi-region deployments as self-hosting add-ons.

Where Arize fits structurally: organisations that want the option to start free on Phoenix (open-source, self-host) and migrate to AX (commercial, SaaS or self-host) as scale demands. The dual-product structure is unusual in the category and is editorially the most flexible procurement path of the four. The flip side is that Phoenix and AX are not feature-parity products; the migration is a real engineering pass, not a runtime flip.

Helicone, in maintenance mode since March 2026

Helicone (at helicone.ai) describes itself as an “open-source LLM observability and monitoring platform.” The architecture is proxy-based: API calls to OpenAI, Anthropic, Azure, LiteLLM, Anyscale, Together AI, OpenRouter, and others are routed through a Helicone proxy that captures the call metadata, cost, latency, and response.

Pricing sits at four tiers: Hobby at $0 with 10,000 free requests, 1 GB storage, one seat, one organisation; Pro at $79/month with unlimited seats, alerts, reports, the HQL query language, and 10k free requests plus usage-based charges; Team at $799/month (marked “Best Value”) with five organisations, SOC-2 and HIPAA compliance, dedicated Slack channel support; Enterprise at custom pricing with custom MSA, SAML SSO, on-prem deployment, and bulk discounts.

The procurement-material fact for any 2026 buyer is the maintenance-mode status. Helicone announced on 3 March 2026 that the founding team was joining Mintlify. Verbatim from the announcement: “Helicone’s services will remain live for the foreseeable future in maintenance mode. This means security updates, new models, bug & performance fixes all keep shipping.” The founders’ rationale named Mintlify’s “exceptional product-market fit” and the thesis that “the value of up-to-date knowledge will be massive in the agentic future.” There is no specific migration timeline or product discontinuation date in the announcement.

The procurement implication is unambiguous. A 2026 buyer should not select Helicone for greenfield deployments. Existing Helicone customers face an open-ended maintenance horizon and should plan for migration to one of the actively-developed alternatives within the next 6 to 12 months. The product is not abandoned, but it is not under active product development, and the absence of a published end-of-life date is itself a procurement risk a 2026 buyer should not absorb.

LangSmith, the LangChain-stack-native observability + eval bundle

LangSmith was covered as a procurement option in the eval companion piece. The observability cut is parallel: framework-agnostic at the API level, framework-native (LangChain, LangGraph) at the integration density that actually matters in production. Pricing sits at three tiers with the same structure described in the eval piece: Developer at $0 per seat with up to 5k base traces per month, Plus at $39 per seat per month with up to 10k base traces, Enterprise custom-priced with hybrid deployment, custom SSO and RBAC, and an SLA. Deployment costs are itemised separately at $0.0007/min for dev and $0.0036/min for production deployments.

The observability-specific strengths are the LangGraph-native trace model (each LangGraph node is a traceable span without bespoke instrumentation), the visual agent-design surfaces (Fleet, Studio), and the compliance posture (SOC 2 Type 2, HIPAA, GDPR, US or EU cloud regions on Plus, hybrid and self-host on Enterprise).

Where LangSmith fits structurally: organisations that have committed to the LangChain stack and want the observability and eval surfaces to share one vendor relationship. The same lock-in caveat applies: the framework-agnostic SDK exists and works, but the integration density with LangGraph is the load-bearing reason teams pick LangSmith over the framework-agnostic alternatives.

OpenTelemetry GenAI as the standardisation pressure

The structural shift the category is in the middle of, and the one most procurement teams have not yet absorbed, is that observability for AI agents is moving to a standard. The OpenTelemetry GenAI semantic conventions define how generative-AI operations are instrumented and observed across model spans, agent spans, events for inputs and outputs, exceptions, and metrics. Platform-specific conventions exist for Anthropic, Azure AI Inference, AWS Bedrock, OpenAI, and the Model Context Protocol.

The status is the editorially material fact. The conventions are in “Development” status, not yet stable. Backward compatibility is enforced for existing implementations: organisations using v1.36.0 or earlier “SHOULD NOT change the version of the GenAI conventions that they emit by default.” Adoption of newer experimental versions requires setting OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental.

The procurement implication is that picking a 2026 observability platform requires asking whether the platform implements the OpenTelemetry GenAI semantic conventions today, whether at the development or stable cut, and what the platform’s commitment is to following the spec when it stabilises. Langfuse documents OpenTelemetry compatibility at the integration level. Arize AX is OpenTelemetry-compatible at the trace ingestion layer. LangSmith accepts OpenTelemetry traces. Helicone’s proxy architecture predates the GenAI conventions and is not built around them.

A buyer that does not ask the OTel-compliance question signs a vendor lock-in to a non-standard trace model. When the spec stabilises (likely 2026 or 2027), that buyer faces a re-instrumentation pass that is materially more expensive than the upfront cost of picking an OTel-compliant platform from the start.

The capability matrix

Pricing rows below cite the public pages: Langfuse, Arize, Helicone, LangSmith. Verified 3 May 2026.

DimensionLangfuseArize (AX + Phoenix)HeliconeLangSmith
PostureOpen source MIT (core); enterprise features /eeOpen source (Phoenix); SaaS commercial (AX)Open source proxy; SaaS commercialSaaS-first; Enterprise self-host
Active developmentYes (v3.172.1, 1 May 2026)Yes (Phoenix + AX both shipping)Maintenance mode since 3 Mar 2026Yes
Free tierHobby $0, 50k units/mo, 30d retentionPhoenix free OSS; AX Free $0, 25k spans/moHobby $0, 10k requestsDeveloper $0, 5k traces/mo
Mid tierCore $29/mo or Pro $199/moAX Pro $50/mo, 50k spans/moPro $79/mo or Team $799/moPlus $39/seat/mo
Enterprise tier$2,499/mo, audit logs, SCIM, dedicated SECustom, SaaS or self-host, SOC2 + HIPAACustom, on-prem, SAML SSOCustom, hybrid or self-host, SLA
Self-hostDocker, Kubernetes/Helm, Terraform AWS/Azure/GCPPhoenix free; AX Enterprise self-hostEnterprise on-premEnterprise self-host
Trace modelSpans + sessions + tracesOTel-native + 30+ providersProxy intercepts + replaysLangGraph-native + framework-agnostic SDK
Cost trackingPer-call cost, model price bookPer-span cost in AXPer-request cost (proxy-native)Per-trace cost
Prompt managementVersioned promptsVersioned prompts in AXPrompt registryPrompt Hub
Drift detectionEval cadence + score deltasContinuous evaluation, alertsLimited (proxy logs)Eval cadence + experiment compare
OTel GenAI complianceCompatibleOTel-nativePre-OTel proxyCompatible (accepts OTel traces)
Compliance postureSOC2 + ISO27001 (Pro+); HIPAA at Cloud EnterpriseSOC2 + HIPAA (AX Enterprise)SOC-2 + HIPAA (Team+)SOC 2 Type 2, HIPAA, GDPR (Plus+)
Data residencyUS, EU, Japan, HIPAA-US regionUS, EU, CA (Enterprise); residency add-onsNot detailed publiclyUS or EU (Plus); hybrid/self-host (Enterprise)

The matrix is informative but it is not the procurement decision. The procurement decision is one structural axis: open-source-first vs SaaS-first, with maintenance-mode vendors disqualified for greenfield.

The procurement decision that is not the eval decision

Three buyer cohorts emerge from the four platforms.

Open-source-first buyers with engineering capacity to self-host should default to Langfuse for greenfield deployments and consider Phoenix as a no-cost Arize-migration path. Langfuse’s MIT core is the broadest open-source posture, the self-host documentation is the most mature, and the data-residency story (US, EU, Japan, HIPAA-US) is the strongest. Phoenix is the right choice when the team expects to migrate to Arize AX commercial as scale demands.

SaaS-first buyers committed to the LangChain stack should default to LangSmith for the LangGraph-native trace model and the bundled-eval surface; the lock-in is real and is reasonably named in procurement diligence. SaaS-first buyers not committed to LangChain should evaluate Arize AX for the broadest provider integration coverage (30+ providers) and the AI assistant (Alyx) that the framework-agnostic UX has been built around.

Existing Helicone customers should plan a migration within the next 6 to 12 months. The maintenance-mode status is not a sunset announcement but it is not a roadmap commitment either, and a procurement team renewing a multi-year contract against a maintenance-mode product is underwriting a class of risk the published procurement guidance has not yet caught up with.

The cost model that breaks at scale, and the question most procurement teams underestimate, is the trace-volume one. At 50 million spans per month (a realistic mid-2026 number for an enterprise running 10 production agents), Langfuse Cloud pricing on graduated overage runs to roughly $36k per year before retention overage. Arize AX at the same volume requires a custom Enterprise quote. LangSmith Plus at the same volume crosses the per-trace overage threshold and requires Enterprise renegotiation. Helicone’s proxy cost grows with request volume, not span volume, which is structurally cheaper for a chat-completion-only workload and structurally more expensive for an agent making 5 tool calls per chat turn. Procurement teams that benchmark on $-per-month at the free tier sign deals that scale 10x in cost when the agent goes to production.

How this maps to the EU AI Act Article 12 audit-evidence template

The EU AI Act Article 12 audit-evidence template requires automatic recording of events (“logs”) generated by high-risk AI systems throughout the system’s lifecycle. The structural question for a 2026 buyer is which of the four observability platforms produces logs that satisfy the Article 12 requirements without bespoke integration.

LangSmith Enterprise (self-host or hybrid) maps cleanly. Customer-controlled audit substrate, OTel-compatible trace model, retention controlled by the customer, and the compliance posture (HIPAA, GDPR, SOC 2 Type 2) is the strongest of the four. Langfuse Cloud Enterprise plus self-host satisfies the requirement when the customer takes the EU residency option and pays for the audit-log feature in the /ee enterprise tier. Arize AX Enterprise satisfies the requirement on EU residency at the Enterprise tier. Helicone’s audit-trail completeness for Article 12 is engagement-specific and, given the maintenance-mode status, should not be the procurement choice for a 2026 high-risk deployment.

The structural lesson, mirrored from the eval companion piece, is that the procurement decision in this category is shaped by deployment posture, not capability rank. Three of the four platforms ship a capability set any 2024-vintage RFP will rate as comparable. The fourth is in maintenance mode and a buyer who treats it as an apples-to-apples option is underwriting a class of risk the platform’s published documentation does not name.

For a buyer running this evaluation in mid-2026, the recommendation is to map the open-source-first vs SaaS-first axis first, confirm OpenTelemetry GenAI compliance second, validate data-residency posture against EU AI Act Article 12 requirements third, and only then compare the capability matrices. Most enterprise procurement teams do this in the reverse order. Most procurement teams end up with a capability-matrix winner that does not match their compliance posture.

The eval/observability split is the structural lesson the companion piece on DeepEval, Braintrust, LangSmith, and Patronus and this piece together try to make load-bearing. Evaluation answers “is the agent right.” Observability answers “what did the agent do.” The first incident a production agent has is going to ask both questions at the same time, and the procurement decisions that produce defensible answers are different decisions made against different platforms.

ShareX / TwitterLinkedInEmail

Spotted an error? See corrections policy →

Disagree with this piece?

Reasoned disagreement is a first-class signal here. Every review cycle weighs documented dissent; material dissent becomes part of the article's change history. This is not a corrections form — use /corrections/ for factual errors.

Part of the pillar

AI agent procurement

The contracts, SLAs, and evaluation criteria that distinguish agentic-AI procurement from SaaS procurement. 10 other pieces in this pillar.

Related reading

Vigil · 53 reviewedAI agent observability: Langfuse vs Arize vs Helicone vs LangSmith 2026