Agent evaluation frameworks in 2026: DeepEval, Braintrust, LangSmith, and Patronus map to four deployment shapes
The four credible agent-evaluation platforms in 2026 don't compete on capability rank. They fit four distinct deployment shapes. DeepEval is the open-source pytest-native option. Braintrust is the SaaS eval primitive. LangSmith is the LangChain-stack observability and eval bundle. Patronus has pivoted from hallucination specialist to digital-world-model frontier lab. Picking on a generic feature matrix produces the wrong answer for most enterprises.
Holding·reviewed3 May 2026·next+59dIf you run AI implementation for a mid-market or enterprise organisation in 2026, you have probably been pitched some version of the following: pick a foundation model, pick an orchestration framework, deploy. The pitch lands because the foundation models are real, the frameworks ship, and the demos work. None of that is vapor.
The harder problem, and the one that keeps showing up in the survey data, is what happens when the agent gets to production and starts producing answers a customer never sees because nobody is measuring whether they are right. LangChain’s State of AI Agents 2025 report (n=1,340, surveyed 18 Nov to 2 Dec 2025) puts quality at 32 percent the largest production blocker, ahead of latency at 20 percent and cost as a category that “notably decreased from previous years as model pricing fell.” 89 percent of respondents have implemented some form of observability. Only 52.4 percent run offline evaluations. 37.3 percent run online evaluations. Among production agents, 94 percent have observability and 71.5 percent have full tracing. But observability tells you what the agent did, not whether the agent was right.
McKinsey’s November 2025 State of AI (n=1,993 across 105 nations, surveyed 25 June to 29 July 2025) cross-validates the structural shape: 62 percent of respondents say their organisations are at least experimenting with AI agents, 23 percent say they are scaling somewhere in the enterprise, but in any given business function “no more than 10 percent of respondents say their organizations are scaling AI agents.” Only 39 percent report EBIT impact at the enterprise level. Two-thirds of organisations have not yet begun scaling AI across the enterprise.
Read the two surveys together and the picture is clear. Most enterprises are not yet at the eval-as-blocker problem. They are in the “experimenting with one team” phase. But the 23 percent that are scaling, and the customers buying the agent platforms ServiceNow, Salesforce, Microsoft, and Anthropic are now selling, are running into a tooling-choice problem that did not exist in 2024 because the production-grade eval-and-observability category did not exist in 2024.
The 2026 category does exist. It has converged around four credible platforms (DeepEval, Braintrust, LangSmith, Patronus AI) and they do not compete on capability rank. Each fits a different deployment shape. Picking by feature matrix produces the wrong answer for most enterprises. This piece walks the four platforms, the four shapes, and the procurement primitives that distinguish them.
What “evaluation” means in 2026 vs what it meant in 2024
In 2024, “evaluating an LLM” meant running a benchmark. MMLU, HumanEval, GSM8K. The benchmark numbers were vendor-published, the methodology was disclosed, and the question being answered was “is GPT-4 better than Claude 2 on grade-school maths.” This is still a useful question, but it is not the question enterprises are asking in 2026.
In 2026, “evaluating an agent” means three coupled questions. Did the agent reach the right answer (correctness)? Did the agent reach it in a way that did not break (tool-correctness, plan-adherence, step-efficiency)? Will the agent still reach it next quarter when the underlying model gets a silent refresh, the system prompt picks up an edit, the retrieval index gets re-embedded, or the upstream tool changes its API surface (regression, drift)?
The first question is the model-benchmark legacy. The second and third are what the four 2026 platforms were built to address, and they address them differently because they assume different things about where the agent runs and who owns the audit trail.
DeepEval’s v3.9.9 release notes (1 Dec 2025) make this concrete. The metric library has expanded materially since the 2024 cut. Alongside the RAG-pipeline metrics (Answer Relevancy, Faithfulness, Contextual Recall and Precision and Relevancy, RAGAS) and the LLM-as-judge primitives (G-Eval, DAG), it now ships agent-focused metrics that did not exist as named primitives in 2024: Task Completion, Tool Correctness, Goal Accuracy, Step Efficiency, Plan Adherence, Plan Quality. There are multi-turn conversation metrics (Knowledge Retention, Conversation Completeness, Turn Relevancy, Role Adherence) that distinguish “did the chatbot stay on task across 11 turns” from “did the chatbot answer the first turn well.” This is the 2026 surface area. The 2024 surface area was answer-correctness on single-turn QA.
The procurement implication is that an enterprise picking an eval platform on the strength of its 2024 RAG-metric coverage is solving the wrong problem if the agent in question is multi-turn and tool-using. Most production agents in 2026 are.
The four 2026 platforms
DeepEval, the open-source pytest-native option
DeepEval is published by Confident AI. The repository at github.com/confident-ai/deepeval shows 15.1k stars and 1.4k forks as of the v3.9.9 release. It runs in Python (≥3.9), integrates with pytest including @pytest.mark.parametrize and parallel execution flags, and supports the providers an enterprise team is realistically integrating against in 2026: OpenAI, Anthropic, LangChain, LangGraph, CrewAI, LlamaIndex, Pydantic AI, AWS AgentCore.
The deployment posture is the most permissive of the four. The package runs locally; the cloud platform (Confident AI, accessed via deepeval login) is optional. An enterprise that does not want any eval data leaving its network can run the entire framework on its own infrastructure. There is no SaaS-tier-gated metric. Every metric ships in the open-source package.
Where DeepEval fits structurally: engineering-led teams that want eval as code, evaluation runs that live in CI alongside unit tests, and an audit trail that is git-backed rather than vendor-backed. The flip side is that DeepEval does not ship a UI for non-engineers. Product owners who need to inspect failing traces interactively, label edge cases, and run human-review queues are buying that out of band, typically by paying for Confident AI’s hosted offering, which is the upgrade path the project is built around.
Braintrust, the SaaS eval primitive
Braintrust (at braintrust.dev) describes itself, on its own homepage, as “the AI observability platform helping teams measure, evaluate, and improve AI in production.” The eval primitive is a first-class object (Eval, Dataset, Project, Score) and the trace model is integrated with the OpenAI and Anthropic SDKs so a team that has already wired its agent against either provider can drop Braintrust in without rewriting the inference code.
Pricing on the public pricing page is structured in three tiers. Starter at $0/month with 1 GB of processed data, 10k scores, 14 days of retention, and unlimited users, projects, datasets, and experiments. Pro at $249/month with 5 GB processed data, 50k scores, 30 days of retention, and the more advanced features (custom charts, environments, priority support). Enterprise at custom pricing, and this is the editorially material tier, with “on-prem or hosted deployment for high volume or privacy-sensitive data,” S3 data export, SAML SSO, and custom RBAC.
Where Braintrust fits structurally: organisations that want a SaaS eval-and-observability surface where the eval primitive is the first-class object the team works with, not a side feature of an orchestration framework. The Pro tier is realistic for a team scaling out one or two agents at a time. The Enterprise tier is where the on-prem and audit-trail requirements get surfaced, and the on-prem option makes Braintrust the only one of the SaaS-first platforms in this category that an EU-residency-bound or pharma-bound buyer can plausibly underwrite.
LangSmith, the LangChain-stack observability and eval bundle
LangSmith (documentation now at docs.langchain.com/langsmith after the 2026 redirect from the smith.langchain.com URL) is described in the LangChain docs as “a framework-agnostic platform for building, debugging, and deploying AI agents and LLM applications.” The framework-agnostic framing is true at the API level. LangSmith accepts traces from non-LangChain code via SDK. But the integration density with LangChain and LangGraph is the load-bearing structural fact. If the agent is built on LangGraph, LangSmith is the path of least resistance. If the agent is not, LangSmith is one option among four.
Pricing sits at three tiers. Developer is $0 per seat per month with up to 5k base traces per month, then pay-as-you-go, single-seat only, community support. Plus is $39 per seat per month with up to 10k base traces per month and unlimited seats; Plus also includes one free dev-sized deployment, email support, up to 500 Fleet runs monthly. Enterprise is custom-priced with hybrid deployment, custom SSO and RBAC, deployed engineering access, and an SLA. Deployment costs are itemised separately at $0.0007/min for dev deployments and $0.0036/min for production deployments on Plus or above.
The structural fact most enterprise procurement teams will care about is data residency. LangSmith’s cloud is hosted in either US or EU regions for the Developer and Plus tiers. The hybrid posture (SaaS control plane, customer-VPC data plane) and the fully self-hosted posture are Enterprise-tier-only. LangSmith’s compliance posture is HIPAA, SOC 2 Type 2, and GDPR per the documentation’s own callout, the only one of the four platforms that names HIPAA explicitly on its public docs as of the writing date.
Where LangSmith fits structurally: organisations that have committed to the LangChain stack and want one vendor relationship across orchestration (LangGraph), observability (LangSmith), and evaluation (LangSmith). The lock-in is real and is reasonably named in the procurement diligence. The flip side is that the integration depth produces capabilities the framework-agnostic alternatives cannot match: Fleet (visual agent design), Studio (visual end-to-end design), and the LangGraph-native trace model.
Patronus AI, and a vendor pivot worth tracking
The brief that scoped this piece framed Patronus AI as the hallucination-detection specialist anchored on Lynx, the model that Patronus says was “the first model that beats GPT-4 on hallucination tasks” (verbatim from the patronus.ai homepage). That framing was correct at the start of 2026 and the Lynx model is still public. The framing has materially shifted by mid-2026.
Patronus’s current homepage now describes the company as “a frontier lab developing simulation research and infrastructure” and centres the product narrative on “Digital World Models, systems that predict and simulate agent actions in digital workflows.” The four flagship capabilities listed are Deep Research (understanding and reasoning over large semantic datasets), Multi-Turn Dialogue (collaborative problem solving), Long Horizon (task planning spanning days to months), and Memory (agentic memory with context windows and tooling). The recent research callouts are BLUR (a tip-of-the-tongue evaluation dataset, arXiv 2503.19193) and GLIDER (an evaluation model for explainable reasoning, arXiv 2412.14140).
This is the kind of vendor movement a tracked-claim framework is built to surface. A buyer who selected Patronus in late 2025 on the strength of the Lynx hallucination story is, in mid-2026, holding a contract with a vendor that has repositioned itself toward agent simulation and digital-world-model research. The Lynx model is still real and still useful for the narrow hallucination-detection task. The company’s future-product roadmap is no longer organised around it. Procurement leads underwriting a multi-year commitment should ask Patronus directly which product line will be the load-bearing one in 2027 and 2028 before signing.
The four platforms produce four answers. Picking by capability rank (“which one has the most metrics” or “which one scored highest on a vendor benchmark”) produces a worse procurement outcome than picking by deployment shape.
The capability matrix
| Dimension | DeepEval | Braintrust | LangSmith | Patronus AI |
|---|---|---|---|---|
| Posture | Open source (Apache 2.0); optional SaaS | SaaS-first; Enterprise on-prem available | SaaS-first; Enterprise self-host available | Frontier-lab-positioned; product surface in flux |
| Free tier | Full open-source package | Starter $0/mo, 14d retention | Developer $0/seat, 5k traces/mo | Not publicly priced |
| Mid tier | Confident AI cloud (custom) | Pro $249/mo, 30d retention | Plus $39/seat/mo, 10k traces/mo | Not publicly priced |
| Enterprise tier | Confident AI hosted/on-prem | On-prem or hosted, S3 export, SAML SSO, RBAC | Hybrid or self-host, SSO, RBAC, SLA | Direct sales |
| Eval primitive | Python pytest test cases plus 25+ metrics | Eval, Dataset, Score, Span | Eval datasets plus traces plus experiments | Lynx plus research-grade evaluators |
| Agent metrics | Task Completion, Tool Correctness, Goal Accuracy, Step Efficiency, Plan Adherence, Plan Quality | Eval primitives generalise; agent-trace surface | LangGraph-native multi-step trace plus eval | BLUR, GLIDER (research) |
| RAG metrics | RAGAS, Answer Relevancy, Faithfulness, Contextual Recall/Precision/Relevancy | Generic evaluators plus custom scores | LangChain-stack-native | Lynx (hallucination) |
| Multi-turn | Knowledge Retention, Conversation Completeness, Turn Relevancy, Role Adherence | Score-based on traces | Conversation trace surfaces | Multi-turn flagship |
| Compliance posture | Customer-managed (whatever Python runs on) | SOC 2; on-prem available | SOC 2 Type 2, HIPAA, GDPR; EU region | Not named publicly |
| Data residency | Customer-controlled | US default; on-prem option | US or EU; hybrid; self-host | Not named publicly |
The matrix is informative but it is not the procurement decision. The procurement decision is the deployment shape.
The four deployment shapes
Shape 1: engineering-led, eval-as-code, audit-trail-in-git
The team has senior engineering leadership and an in-house view that evaluation is a software-engineering practice, not a product-management one. Eval runs go through CI. Failing tests block merges the same way unit-test failures do. The audit trail is the git history; the dashboards are whatever Grafana or Datadog the engineering org already runs. The product owners who care about edge-case behaviour read summaries from engineering, not interactive dashboards.
DeepEval is the natural fit. Open-source posture, pytest-native, no vendor lock-in, no per-seat fees, Confident AI cloud as an optional upgrade path when the team wants a UI for non-engineers.
The shape fails when the team does not have engineering capacity to run this discipline themselves. Eval-as-code is cheap when an engineering manager owns it and expensive when product owners are paying agency contractors to write test cases.
Shape 2: SaaS-first, eval-as-product, vendor-managed audit trail
The team wants the eval primitive to be the first-class object their day-to-day tooling is organised around. Engineering is happy to wire SDKs against OpenAI or Anthropic and let the vendor own the dashboards, the dataset versioning, the experiment-comparison tooling, and the human-review queues. Procurement is willing to pay $249 per project per month for the Braintrust Pro tier and budget for Enterprise when the agent goes to production.
Braintrust is the natural fit. The Eval-Dataset-Score primitives are clean. The on-prem option is available at Enterprise tier for the EU-residency or pharma-regime bound buyer. Pricing is published.
The shape fails when the agent is built on LangGraph and the team is going to spend more time wiring Braintrust into LangGraph than they would gain from Braintrust’s UI. In that case the next shape is the right one.
Shape 3: LangChain-stack-native, observability + eval bundled
The team has committed to LangChain or LangGraph as the orchestration framework. The LangSmith integration is the path of least resistance: one vendor relationship spans tracing, evaluation, and the visual agent-design surfaces (Fleet, Studio). Data residency is published (US or EU on cloud; hybrid and self-host on Enterprise). The compliance posture is the strongest of the four (SOC 2 Type 2, HIPAA, GDPR). Pricing is per-seat and predictable.
LangSmith is the natural fit, and it is the only fit if the agent is LangGraph-native and the procurement team is unwilling to underwrite a second-vendor integration risk.
The shape fails when the team is locked into LangChain on inference but locked out of LangChain on data, for example, when the agent runs in a customer’s VPC against a customer’s data and the LangSmith hybrid posture is unavailable at Plus tier. The Enterprise upgrade is the answer; the Plus-tier-procurement-with-hybrid-expectations buyer is the failure mode.
Shape 4: research-grade hallucination + simulation, vendor-direct
The team has a narrow, hard problem, usually hallucination detection in a regulated industry, or agent simulation against a workflow that does not yet exist in production. The buyer is willing to engage Patronus directly as a frontier-lab vendor and underwrite a research-grade contract rather than a product-grade SaaS subscription.
Patronus AI is the fit, with the caveat that the company’s strategic positioning has shifted in 2026. A 2025 Patronus contract anchored on Lynx for hallucination detection is still serviceable; the Lynx model is in production. A 2026 or 2027 contract should be re-scoped against Patronus’s current digital-world-models framing, not the framing the brief assumed.
Procurement primitives — where the platforms answer cleanly
For an enterprise running the 60-question agentic AI RFP, four primitives separate the four platforms more cleanly than any feature matrix.
Data residency. DeepEval is customer-controlled (anywhere Python runs). Braintrust defaults to US and offers on-prem at Enterprise. LangSmith publishes US and EU cloud regions plus hybrid and self-host at Enterprise. Patronus does not name data-residency posture publicly as of the writing date. For an EU-AI-Act-bound or HIPAA-bound buyer, the question is answerable for three of the four platforms and requires a sales conversation for the fourth.
Audit trail. DeepEval’s audit trail is the git history of the eval-as-code repository plus Confident AI when used. Braintrust’s audit trail is the SaaS platform’s eval and trace stores; on-prem deployments give the customer the audit substrate directly. LangSmith’s audit trail is similar: SaaS-managed at Plus, customer-controlled at Enterprise. Patronus’s audit trail is engagement-specific. The EU AI Act Article 12 audit-evidence template maps cleanly to LangSmith Enterprise self-host and Braintrust Enterprise on-prem; it requires a custom integration with DeepEval (engineering-led) and a sales conversation with Patronus.
Drift monitoring. This is where evaluation diverges from observability. Observability tells you the agent’s traffic patterns. Drift monitoring tells you the agent’s output distribution is no longer what it was last quarter. DeepEval supports drift evaluation as a code-level test pattern: re-run a regression suite against the new model, compare scores, alert on regression. Braintrust and LangSmith both support drift comparison as a UI-level workflow (compare experiment A to experiment B, surface score deltas). Patronus’s research-grade evaluators are not designed for production-cadence drift monitoring as of mid-2026.
Cost model. DeepEval has zero direct platform cost (open source); the cost is engineering time and any LLM-judge calls the metrics make. Braintrust is processed-data-priced ($4 per GB overage on Starter, $3 per GB on Pro) plus scores ($2.50 per 1k on Starter, $1.50 per 1k on Pro). LangSmith is per-seat ($39 per seat) plus per-trace overage plus per-deployment-minute. Patronus is direct-sales. For a 50-engineer team running 10 agents in parallel at scale, LangSmith Plus runs to roughly $20k a year in seats before usage, Braintrust Pro at one project per agent runs to roughly $30k a year before usage, and DeepEval runs to whatever the engineering time costs. The structural cost-model question is which of these scales linearly with traffic and which scales linearly with team size; the answer is different for each platform.
The procurement decision the 60-question RFP forces you into
The structural lesson from running this comparison through to the procurement primitives is that capability rank is a poor selection mechanism. Three of the four platforms ship a capability set that any 2024-vintage RFP will rate as “comparable.” The fourth (Patronus) is now in a different category from the other three and a buyer who treats it as an apples-to-apples option will misread it.
The selection mechanism that survives the 60-question RFP is deployment shape. The question that distinguishes the four shapes is not “which platform has the most metrics” but “where does the eval evidence live, who owns the audit trail, and what happens to it when the vendor changes its strategic positioning.”
A buyer that does not know the answer to the third question is the buyer most exposed to the Patronus pattern. The pattern is not Patronus-specific. Vendor pivots in this category happen on a quarterly cadence and the tracked-claims framework is built precisely to surface them. The procurement diligence question is whether the organisation buying the eval platform has a way to know in 9 months that the platform’s product positioning has changed, and what the contractual protection looks like when it does.
For a buyer running this evaluation in mid-2026, the recommendation is to map the deployment shape first (engineering-led, SaaS-first, LangChain-stack-native, research-grade), then to map the four primitives (data residency, audit trail, drift monitoring, cost model), and only then to compare the capability matrices. Most enterprise procurement teams do this in the reverse order. Most procurement teams end up with a capability-matrix winner that does not match their deployment shape.
The deployment-shape lens is also the lens that makes the observability companion piece on Langfuse, Arize, Helicone, and LangSmith load-bearing rather than redundant. Evaluation answers “is the agent right.” Observability answers “what did the agent do.” Production deployments need both. The two procurement decisions are different. Conflating them is the most common 2026 procurement failure mode in this category.
Spotted an error? See corrections policy →
Reasoned disagreement is a first-class signal here. Every review cycle weighs documented dissent; material dissent becomes part of the article's change history. This is not a corrections form — use /corrections/ for factual errors.
AI agent procurement →
The contracts, SLAs, and evaluation criteria that distinguish agentic-AI procurement from SaaS procurement. 10 other pieces in this pillar.