What is the difference between agent evaluation as a procurement decision and agent evaluation as an operational discipline?

The procurement decision (which platform to buy) is the easier of the two and is covered at AM-122 (DeepEval vs Braintrust vs LangSmith vs Patronus). The operational discipline is what determines whether the chosen platform produces useful signal. Three components compose the discipline: eval-set design (what the platform actually evaluates against), drift-detection cadence (how the platform's signal moves over time), and regression-budget framework (what threshold change triggers action). Most enterprises that buy an evaluation platform under-invest in all three operational components and end up with a procurement-defensible platform that produces signal nobody acts on. The operational discipline is the harder problem.

How should an eval-set be designed for a production AI agent?

Three layers. (1) Calibration set: 50-200 prompts that represent the agent's intended use cases, with known-good responses or scoring rubrics. The calibration set establishes the baseline behavior and detects deployment regressions. (2) Edge-case set: 30-100 prompts that test failure modes — adversarial inputs, unusual phrasing, low-context queries, multi-turn drift scenarios. The edge-case set detects model upgrades or prompt changes that move the agent's behavior in subtle ways. (3) Production-sampled set: 10-50 prompts per week sampled from actual production traffic, anonymised, with manual scoring. The production-sampled set is what catches the drift the calibration set misses because production traffic does not look like the calibration set after the agent has been in use for several months. The combination produces a signal that survives across model upgrades, prompt changes, and natural traffic evolution.

What does drift detection look like for a production agent?

Three signal classes. (1) Output-distribution drift — statistical changes in the length, structure, or content patterns of the agent's responses, detected against a baseline window. (2) Score-distribution drift — changes in the eval-set score distribution over time, detected against the deployment-time baseline. (3) Tool-use distribution drift — changes in the agent's tool-call patterns (which tools called, in what order, against what input), detected against the production baseline. Each signal class has its own alert threshold and its own response pattern. Drift is not always bad — model upgrades produce drift, customer-traffic changes produce drift, intentional prompt updates produce drift. The discipline is detecting drift, attributing it to a cause, and deciding whether to act.

What is a regression budget and how should it be set?

A regression budget is the pre-defined tolerance for evaluation-score decline before action is required. The defensible default is 5% absolute decline on the calibration set or 10% absolute decline on the edge-case set, computed per release window (typically weekly or per-deployment). A release that produces a drop above the budget triggers a release hold or a rollback decision; a drop below the budget is accepted as natural variance. The budget framework forces the evaluation discipline to produce binary decisions (ship / hold) rather than ambiguous signals. Enterprises that operate without explicit regression budgets typically end up either holding releases on noise or shipping releases through real regressions because the score-decline signal was ambiguous.

How does production agent evaluation align with MTTD-for-Agents?

MTTD-for-Agents (Mean Time To Detect, the publication's house framework for agent-incident detection latency) covers the broader runtime monitoring surface. Production evaluation is one component of that surface — specifically the component that detects behavioral regression as distinct from operational failure. An agent that ships with a production-evaluation cadence of one hour has a latency floor of one hour for behavioral incidents. An agent without production evaluation has effectively no detection of behavioral incidents until users report them. The MTTD-for-Agents framework treats evaluation cadence as one of the four runtime metrics that determine the observability stack's actual detection capability.

Agent evaluation in production 2026: eval-set, drift, regression budgets

At a glance

Claim

Agent evaluation in production resolves on three operational components that determine whether the chosen evaluation platform produces useful signal: eval-set design across three layers (50-200 calibration prompts, 30-100 edge-case prompts, 10-50 production-sampled prompts per week), drift detection across three signal classes (output-distribution, score-distribution, tool-use distribution), and a regression-budget framework that forces binary ship/hold decisions (defensible default 5% absolute decline on calibration set, 10% on edge-case set, per release window). The procurement decision (which platform to buy, covered at AM-122) is the easier half; the operational discipline is what most enterprises under-invest in even after buying a platform.

Date

5 May 2026

Verdict

Holding(AM-137)

Next review

4 Jul 2026(+59d)

Bottom line. The procurement decision on agent evaluation platforms (DeepEval, Braintrust, LangSmith, Patronus, covered at AM-122) is the easier half of the agent evaluation question. The harder half is the operational discipline that determines whether the chosen platform produces useful signal: eval-set design across three layers, drift detection across three signal classes, and a regression-budget framework that forces binary decisions. This piece sits between the evaluation-tooling cluster and the MTTD-for-Agents observability framework, the in-production cut that most enterprises under-invest in even after buying a platform.

If you run agent deployment for a mid-market or enterprise organisation in 2026, the question on your roadmap is rarely “should we evaluate our agents.” Most teams have evaluated. The deployment shipped. The eval-set produced reasonable scores at deployment time. The platform sits in the stack. The harder question, the one that shows up as production incidents three months later, is whether the evaluation discipline keeps producing useful signal as the agent’s behavior, the model’s behavior, and the production traffic all evolve in different directions.

This piece walks the three components of the operational discipline (eval-set design, drift detection, regression budgets), the alignment with the MTTD-for-Agents framework the publication tracks as a signature framework, and the procurement-defensible posture for enterprises retrofitting evaluation discipline into deployments that shipped without it.

Why eval at procurement time is not enough

The procurement-time eval set is what the team built when they were choosing the platform. It is calibrated to the deployment-time scope, scored against deployment-time prompts, and produces deployment-time signal. It is also, in practice, the smallest version of the eval-set the deployment will ever have, because the calibration is sized for the procurement decision rather than for the operational discipline.

Three patterns make the procurement-time eval-set insufficient for production.

Production traffic diverges from calibration traffic. Real users issue prompts the calibration set did not anticipate, in patterns the calibration set did not represent. After 4-12 weeks in production, the agent is being evaluated against scenarios that match its actual operating envelope rather than the procurement scenarios the team scoped at deployment time.

Model behavior shifts. Foundation-model providers update their models on cadences the customer does not control. A model upgrade can move agent behavior in ways the calibration set does not detect because the calibration set was scored against the prior model. Without drift detection, the agent’s behavior change is invisible until users report it.

Prompt and tool-graph changes accumulate. Engineering teams iterate on prompts, add tools, refine policies. Each change is small; in aggregate, after 6-12 months, the agent is materially different from the version that was evaluated at deployment. The eval-set has to evolve with the agent or it stops measuring what the agent actually does.

The operational discipline is what keeps the evaluation useful past deployment. Three components compose it.

Component 1: eval-set design across three layers

Layer 1: calibration set. 50-200 prompts that represent the agent’s intended use cases, with known-good responses or scoring rubrics. The calibration set is what the procurement decision was based on; it stays in the eval suite as the baseline that detects deployment regressions. Update cadence: quarterly, or on material agent-purpose changes.

Layer 2: edge-case set. 30-100 prompts that test failure modes, adversarial inputs designed to surface prompt-injection vulnerabilities, unusual phrasing that tests robustness, low-context queries that test the agent’s hedging behavior, multi-turn drift scenarios where the agent’s task is gradually redirected. The edge-case set evolves as the agent red-team (AM-126) discipline surfaces new attack patterns. Update cadence: monthly, with explicit triggers from red-team findings or OWASP Agentic Top 10 (AM-043) updates.

Layer 3: production-sampled set. 10-50 prompts per week sampled from actual production traffic, anonymised, with manual scoring against the agent’s intended-use rubric. The production-sampled set is what catches the drift the calibration set misses, because production traffic looks different from the calibration set after several months of agent operation. Update cadence: continuous, with the prior week’s sample retained for comparison and the rolling 12-week window analyzed for distribution changes.

The combination produces a signal that survives across model upgrades, prompt changes, and natural traffic evolution. An eval-set with only Layer 1 is procurement-shaped, not operations-shaped. An eval-set with all three layers is what the operational discipline requires.

Component 2: drift detection across three signal classes

Output-distribution drift. Statistical changes in the length, structure, or content patterns of the agent’s responses, detected against a rolling baseline window. The signals that matter are response-length shifts (often indicate prompt changes or model upgrades), structure shifts (formatting changes, list-vs-prose changes, citation patterns), and content-distribution shifts (vocabulary changes, tone shifts, topic drift). Detection cadence: hourly to daily, depending on traffic volume.

Score-distribution drift. Changes in the eval-set score distribution over time, detected against the deployment-time baseline. The signals that matter are mean-score declines (the headline regression signal), variance increases (the agent has become less consistent), and tail-distribution changes (the worst-case responses are getting worse even if the mean holds). Detection cadence: per-release for calibration sets, weekly for production-sampled sets.

Tool-use distribution drift. Changes in the agent’s tool-call patterns, which tools the agent calls, in what order, against what input. The signals that matter are tool-frequency shifts (the agent is calling a tool more or less than baseline), tool-sequence shifts (the order of multi-tool flows is changing), and input-distribution shifts (the tool inputs are getting longer, shorter, or structurally different). Detection cadence: daily, with weekly aggregation for trend analysis.

Each signal class has its own alert threshold and its own response pattern. Drift is not always bad, model upgrades produce drift, customer-traffic changes produce drift, intentional prompt updates produce drift. The operational discipline is detecting drift, attributing it to a cause (model change, prompt change, traffic change, or unattributable), and deciding whether to act.

The attribution step is what most teams skip. Drift detected without attribution becomes either alert fatigue (every signal triggers investigation that finds no actionable cause) or signal blindness (real regressions are missed because the team has stopped investigating the alert stream). The procurement-defensible discipline is to require attribution before acting and to triage alerts by attribution category.

Component 3: regression budgets

A regression budget is the pre-defined tolerance for evaluation-score decline before action is required. The framework forces the evaluation discipline to produce binary decisions, ship or hold, accept or rollback, rather than ambiguous signals.

Defensible defaults. 5% absolute decline on the calibration set, computed per release window (typically weekly or per-deployment), triggers a release hold pending investigation. 10% absolute decline on the edge-case set triggers the same response, with the additional implication that a security or robustness regression is being investigated. 3% rolling decline on the production-sampled set over 4 weeks triggers a deeper investigation into traffic-distribution change or accumulated prompt drift.

The thresholds are budgets in the explicit sense. A release that produces a 4% calibration-set decline is shipped because it is within budget. A release that produces a 6% decline is held. The framework removes the discretionary call from the engineering team’s day-to-day operation; the call was already made at the budget-setting moment.

Setting the budget. Three inputs. (1) Customer-tolerance for behavioral regression, high-stakes deployments (legal, healthcare, financial advice) tolerate less; lower-stakes deployments (internal productivity, informational queries) tolerate more. (2) Eval-set noise floor, the natural variance in the eval-set score across rerun cycles where nothing has changed. The budget must be larger than the noise floor or every release fires false alerts. (3) Release-cadence implications, a tight budget on a weekly release cadence catches regressions faster but produces more holds; a loose budget on a slower cadence catches less but ships more.

Updating the budget. The budget is itself a procurement-defensible artefact that gets reviewed quarterly. Three triggers update the budget. Eval-set evolution (the calibration set has changed, the noise floor has changed). Customer-tolerance change (the deployment’s risk profile has shifted). Production-incident learning (a regression that escaped budget needs to be modeled in future budgets).

Enterprises that operate without explicit regression budgets typically end up either holding releases on noise (the team sees a 2% decline, panics, holds for a day, finds nothing) or shipping releases through real regressions (the team sees a 7% decline, investigates, attributes incorrectly, ships, regresses production). The budget framework forces the discipline at the moment when discretion is hardest to exercise.

Alignment with MTTD-for-Agents

MTTD-for-Agents is the publication’s house framework for measuring agent-incident detection latency. The framework treats evaluation cadence as one of the four runtime metrics that determine the observability stack’s detection capability.

The four MTTD-for-Agents metrics:

Action volume, the rate at which the agent is taking actions (per minute, per hour). Detection cadence: real-time.
Tool-use distribution, the distribution of tool calls the agent is making, against baseline. Detection cadence: hourly.
Cost-per-action, the token or dollar cost of each agent action. Detection cadence: hourly.
Output distribution, the statistical distribution of the agent’s outputs, against baseline. Detection cadence: hourly to daily.

Production evaluation is the discipline that produces metric 4 (output distribution drift) and complements the other three. The integration is operational, not architectural. The evaluation platform produces the signal; the observability stack (Langfuse, Arize, Helicone, LangSmith, AM-123) consumes the signal alongside the runtime metrics; the incident response runbook treats evaluation regressions and runtime anomalies as variants of the same incident class.

The MTTD-for-Agents floor for behavioral incidents is the evaluation cadence. An agent with hourly production-sampled evaluation has a 1-hour MTTD floor for behavioral regression; an agent with weekly evaluation has a 7-day floor. The procurement-defensible posture is to align the evaluation cadence with the MTTD requirements of the deployment’s risk profile, not to default to the eval platform’s recommended cadence regardless of risk fit.

What changes in 2026 procurement

Three procurement-language additions for AI deployments where the evaluation discipline is in scope.

Eval-set ownership and portability. The customer owns the eval-set IP regardless of which platform hosts it. The MSA includes export rights for the eval-set in formats the customer’s broader audit substrate can consume. Vendor lock-in via eval-set hosting is a procurement risk that competent contracting language now addresses.

Drift-detection signal access. The evaluation platform produces the signals listed above (output distribution, score distribution, tool-use distribution) in formats the customer’s observability stack can consume. The integration is named in the contract; it is not assumed.

Regression-budget audit substrate. The evaluation discipline produces an auditable record of which releases were held, which were shipped, and what budget thresholds applied. The substrate is part of the EU AI Act Article 12 (AM-046) compliance file for high-risk deployments and is part of the customer’s procurement-defensible operational discipline regardless of regulatory regime.

What this piece does not claim

This piece does not claim that the three components must be implemented all at once. The defensible deployment-stage path is calibration set + score-distribution drift detection + a release-window regression budget. The edge-case set, production-sampled set, output-distribution drift, and tool-use distribution drift can be added in subsequent quarters as the discipline matures.

This piece does not claim that any specific eval-set size is universally correct. The 50-200, 30-100, 10-50 ranges are defensible defaults; the right size depends on the agent’s intended-use scope, the variance in production traffic, and the customer’s evaluation budget.

This piece does not claim that drift is always actionable. Many drift signals attribute to causes the customer chose (model upgrade, prompt change) and are accepted rather than corrected. The discipline is to detect, attribute, and decide, not to revert every drift event.

What changes this read

Three triggers would shift the analysis. A foundation-model provider releasing a model that materially changes the eval-set noise floor (typically a more-consistent model that allows tighter regression budgets). Industry-standards convergence on production-sampled evaluation patterns (e.g., the OpenTelemetry GenAI semantic conventions extending to evaluation events). A landmark customer incident attributable to evaluation-discipline failure that produces published learning the procurement language can absorb.

We will re-test against the DeepEval, Braintrust, LangSmith, and Patronus AI documentation, plus the Hamel Husain and Eugene Yan published work on evaluation-driven development, on or before 4 Jul 2026.

The companion reading is AM-122 the four-platform procurement decision, AM-123 the four-platform observability decision, AM-126 the agent red-team discipline, and the MTTD-for-Agents framework where evaluation cadence sits as one of the four runtime metrics.

ShareX / Twitter LinkedIn Email

Spotted an error? See corrections policy →

Disagree with this piece?

Reasoned disagreement is a first-class signal here. Every review cycle weighs documented dissent; material dissent becomes part of the article's change history. This is not a corrections form — use /corrections/ for factual errors.

Part of the pillar

AI agent procurement →

The contracts, SLAs, and evaluation criteria that distinguish agentic-AI procurement from SaaS procurement. 15 other pieces in this pillar.

Agent evaluation in production: eval-set design, drift detection, and regression budgets for the deployed agent

Why eval at procurement time is not enough

Component 1: eval-set design across three layers

Component 2: drift detection across three signal classes

Component 3: regression budgets

Alignment with MTTD-for-Agents

What changes in 2026 procurement

What this piece does not claim

What changes this read

AI agent procurement →

Related reading

Why eval at procurement time is not enough

Component 1: eval-set design across three layers

Component 2: drift detection across three signal classes

Component 3: regression budgets

Alignment with MTTD-for-Agents

What changes in 2026 procurement

What this piece does not claim

What changes this read

The 60-question agentic AI RFP, built as a procurement tool.

AI agent procurement →

Related reading

Foundation-model uptime in 2026: the 24-month outage record across Anthropic, OpenAI, Google, AWS Bedrock, and Azure OpenAI

How vendor case studies travel between enterprise and operator AI buyers — and what each cohort gets wrong from the other's evidence

Agentic AI 2024-2025 retrospective: what actually shipped, what walked back, and what 2026 procurement should learn from each

AI-written analysis, signed by a practitioner. One or two pieces a week.