Agent evaluation in production: eval-set design, drift detection, and regression budgets for the deployed agent
The four 2026 agent-evaluation platforms (DeepEval, Braintrust, LangSmith, Patronus) covered at AM-122 are the procurement decision. The evaluation discipline that decides whether the chosen platform produces useful signal is the eval-set design, the drift-detection cadence, and the regression-budget framework — the three operational disciplines most enterprises buy a platform for and then under-invest in. This piece walks the in-production cut that sits between the eval-tooling decision and the MTTD-for-Agents observability framework.
Holding·reviewed5 May 2026·next+59dBottom line. The procurement decision on agent evaluation platforms (DeepEval, Braintrust, LangSmith, Patronus, covered at AM-122) is the easier half of the agent evaluation question. The harder half is the operational discipline that determines whether the chosen platform produces useful signal: eval-set design across three layers, drift detection across three signal classes, and a regression-budget framework that forces binary decisions. This piece sits between the evaluation-tooling cluster and the MTTD-for-Agents observability framework, the in-production cut that most enterprises under-invest in even after buying a platform.
If you run agent deployment for a mid-market or enterprise organisation in 2026, the question on your roadmap is rarely “should we evaluate our agents.” Most teams have evaluated. The deployment shipped. The eval-set produced reasonable scores at deployment time. The platform sits in the stack. The harder question, the one that shows up as production incidents three months later, is whether the evaluation discipline keeps producing useful signal as the agent’s behavior, the model’s behavior, and the production traffic all evolve in different directions.
This piece walks the three components of the operational discipline (eval-set design, drift detection, regression budgets), the alignment with the MTTD-for-Agents framework the publication tracks as a signature framework, and the procurement-defensible posture for enterprises retrofitting evaluation discipline into deployments that shipped without it.
Why eval at procurement time is not enough
The procurement-time eval set is what the team built when they were choosing the platform. It is calibrated to the deployment-time scope, scored against deployment-time prompts, and produces deployment-time signal. It is also, in practice, the smallest version of the eval-set the deployment will ever have, because the calibration is sized for the procurement decision rather than for the operational discipline.
Three patterns make the procurement-time eval-set insufficient for production.
Production traffic diverges from calibration traffic. Real users issue prompts the calibration set did not anticipate, in patterns the calibration set did not represent. After 4-12 weeks in production, the agent is being evaluated against scenarios that match its actual operating envelope rather than the procurement scenarios the team scoped at deployment time.
Model behavior shifts. Foundation-model providers update their models on cadences the customer does not control. A model upgrade can move agent behavior in ways the calibration set does not detect because the calibration set was scored against the prior model. Without drift detection, the agent’s behavior change is invisible until users report it.
Prompt and tool-graph changes accumulate. Engineering teams iterate on prompts, add tools, refine policies. Each change is small; in aggregate, after 6-12 months, the agent is materially different from the version that was evaluated at deployment. The eval-set has to evolve with the agent or it stops measuring what the agent actually does.
The operational discipline is what keeps the evaluation useful past deployment. Three components compose it.
Component 1: eval-set design across three layers
Layer 1: calibration set. 50-200 prompts that represent the agent’s intended use cases, with known-good responses or scoring rubrics. The calibration set is what the procurement decision was based on; it stays in the eval suite as the baseline that detects deployment regressions. Update cadence: quarterly, or on material agent-purpose changes.
Layer 2: edge-case set. 30-100 prompts that test failure modes, adversarial inputs designed to surface prompt-injection vulnerabilities, unusual phrasing that tests robustness, low-context queries that test the agent’s hedging behavior, multi-turn drift scenarios where the agent’s task is gradually redirected. The edge-case set evolves as the agent red-team (AM-126) discipline surfaces new attack patterns. Update cadence: monthly, with explicit triggers from red-team findings or OWASP Agentic Top 10 (AM-043) updates.
Layer 3: production-sampled set. 10-50 prompts per week sampled from actual production traffic, anonymised, with manual scoring against the agent’s intended-use rubric. The production-sampled set is what catches the drift the calibration set misses, because production traffic looks different from the calibration set after several months of agent operation. Update cadence: continuous, with the prior week’s sample retained for comparison and the rolling 12-week window analyzed for distribution changes.
The combination produces a signal that survives across model upgrades, prompt changes, and natural traffic evolution. An eval-set with only Layer 1 is procurement-shaped, not operations-shaped. An eval-set with all three layers is what the operational discipline requires.
Component 2: drift detection across three signal classes
Output-distribution drift. Statistical changes in the length, structure, or content patterns of the agent’s responses, detected against a rolling baseline window. The signals that matter are response-length shifts (often indicate prompt changes or model upgrades), structure shifts (formatting changes, list-vs-prose changes, citation patterns), and content-distribution shifts (vocabulary changes, tone shifts, topic drift). Detection cadence: hourly to daily, depending on traffic volume.
Score-distribution drift. Changes in the eval-set score distribution over time, detected against the deployment-time baseline. The signals that matter are mean-score declines (the headline regression signal), variance increases (the agent has become less consistent), and tail-distribution changes (the worst-case responses are getting worse even if the mean holds). Detection cadence: per-release for calibration sets, weekly for production-sampled sets.
Tool-use distribution drift. Changes in the agent’s tool-call patterns, which tools the agent calls, in what order, against what input. The signals that matter are tool-frequency shifts (the agent is calling a tool more or less than baseline), tool-sequence shifts (the order of multi-tool flows is changing), and input-distribution shifts (the tool inputs are getting longer, shorter, or structurally different). Detection cadence: daily, with weekly aggregation for trend analysis.
Each signal class has its own alert threshold and its own response pattern. Drift is not always bad, model upgrades produce drift, customer-traffic changes produce drift, intentional prompt updates produce drift. The operational discipline is detecting drift, attributing it to a cause (model change, prompt change, traffic change, or unattributable), and deciding whether to act.
The attribution step is what most teams skip. Drift detected without attribution becomes either alert fatigue (every signal triggers investigation that finds no actionable cause) or signal blindness (real regressions are missed because the team has stopped investigating the alert stream). The procurement-defensible discipline is to require attribution before acting and to triage alerts by attribution category.
Component 3: regression budgets
A regression budget is the pre-defined tolerance for evaluation-score decline before action is required. The framework forces the evaluation discipline to produce binary decisions, ship or hold, accept or rollback, rather than ambiguous signals.
Defensible defaults. 5% absolute decline on the calibration set, computed per release window (typically weekly or per-deployment), triggers a release hold pending investigation. 10% absolute decline on the edge-case set triggers the same response, with the additional implication that a security or robustness regression is being investigated. 3% rolling decline on the production-sampled set over 4 weeks triggers a deeper investigation into traffic-distribution change or accumulated prompt drift.
The thresholds are budgets in the explicit sense. A release that produces a 4% calibration-set decline is shipped because it is within budget. A release that produces a 6% decline is held. The framework removes the discretionary call from the engineering team’s day-to-day operation; the call was already made at the budget-setting moment.
Setting the budget. Three inputs. (1) Customer-tolerance for behavioral regression, high-stakes deployments (legal, healthcare, financial advice) tolerate less; lower-stakes deployments (internal productivity, informational queries) tolerate more. (2) Eval-set noise floor, the natural variance in the eval-set score across rerun cycles where nothing has changed. The budget must be larger than the noise floor or every release fires false alerts. (3) Release-cadence implications, a tight budget on a weekly release cadence catches regressions faster but produces more holds; a loose budget on a slower cadence catches less but ships more.
Updating the budget. The budget is itself a procurement-defensible artefact that gets reviewed quarterly. Three triggers update the budget. Eval-set evolution (the calibration set has changed, the noise floor has changed). Customer-tolerance change (the deployment’s risk profile has shifted). Production-incident learning (a regression that escaped budget needs to be modeled in future budgets).
Enterprises that operate without explicit regression budgets typically end up either holding releases on noise (the team sees a 2% decline, panics, holds for a day, finds nothing) or shipping releases through real regressions (the team sees a 7% decline, investigates, attributes incorrectly, ships, regresses production). The budget framework forces the discipline at the moment when discretion is hardest to exercise.
Alignment with MTTD-for-Agents
MTTD-for-Agents is the publication’s house framework for measuring agent-incident detection latency. The framework treats evaluation cadence as one of the four runtime metrics that determine the observability stack’s detection capability.
The four MTTD-for-Agents metrics:
- Action volume, the rate at which the agent is taking actions (per minute, per hour). Detection cadence: real-time.
- Tool-use distribution, the distribution of tool calls the agent is making, against baseline. Detection cadence: hourly.
- Cost-per-action, the token or dollar cost of each agent action. Detection cadence: hourly.
- Output distribution, the statistical distribution of the agent’s outputs, against baseline. Detection cadence: hourly to daily.
Production evaluation is the discipline that produces metric 4 (output distribution drift) and complements the other three. The integration is operational, not architectural. The evaluation platform produces the signal; the observability stack (Langfuse, Arize, Helicone, LangSmith, AM-123) consumes the signal alongside the runtime metrics; the incident response runbook treats evaluation regressions and runtime anomalies as variants of the same incident class.
The MTTD-for-Agents floor for behavioral incidents is the evaluation cadence. An agent with hourly production-sampled evaluation has a 1-hour MTTD floor for behavioral regression; an agent with weekly evaluation has a 7-day floor. The procurement-defensible posture is to align the evaluation cadence with the MTTD requirements of the deployment’s risk profile, not to default to the eval platform’s recommended cadence regardless of risk fit.
What changes in 2026 procurement
Three procurement-language additions for AI deployments where the evaluation discipline is in scope.
Eval-set ownership and portability. The customer owns the eval-set IP regardless of which platform hosts it. The MSA includes export rights for the eval-set in formats the customer’s broader audit substrate can consume. Vendor lock-in via eval-set hosting is a procurement risk that competent contracting language now addresses.
Drift-detection signal access. The evaluation platform produces the signals listed above (output distribution, score distribution, tool-use distribution) in formats the customer’s observability stack can consume. The integration is named in the contract; it is not assumed.
Regression-budget audit substrate. The evaluation discipline produces an auditable record of which releases were held, which were shipped, and what budget thresholds applied. The substrate is part of the EU AI Act Article 12 (AM-046) compliance file for high-risk deployments and is part of the customer’s procurement-defensible operational discipline regardless of regulatory regime.
What this piece does not claim
This piece does not claim that the three components must be implemented all at once. The defensible deployment-stage path is calibration set + score-distribution drift detection + a release-window regression budget. The edge-case set, production-sampled set, output-distribution drift, and tool-use distribution drift can be added in subsequent quarters as the discipline matures.
This piece does not claim that any specific eval-set size is universally correct. The 50-200, 30-100, 10-50 ranges are defensible defaults; the right size depends on the agent’s intended-use scope, the variance in production traffic, and the customer’s evaluation budget.
This piece does not claim that drift is always actionable. Many drift signals attribute to causes the customer chose (model upgrade, prompt change) and are accepted rather than corrected. The discipline is to detect, attribute, and decide, not to revert every drift event.
What changes this read
Three triggers would shift the analysis. A foundation-model provider releasing a model that materially changes the eval-set noise floor (typically a more-consistent model that allows tighter regression budgets). Industry-standards convergence on production-sampled evaluation patterns (e.g., the OpenTelemetry GenAI semantic conventions extending to evaluation events). A landmark customer incident attributable to evaluation-discipline failure that produces published learning the procurement language can absorb.
We will re-test against the DeepEval, Braintrust, LangSmith, and Patronus AI documentation, plus the Hamel Husain and Eugene Yan published work on evaluation-driven development, on or before 4 Jul 2026.
The companion reading is AM-122 the four-platform procurement decision, AM-123 the four-platform observability decision, AM-126 the agent red-team discipline, and the MTTD-for-Agents framework where evaluation cadence sits as one of the four runtime metrics.
Spotted an error? See corrections policy →
Reasoned disagreement is a first-class signal here. Every review cycle weighs documented dissent; material dissent becomes part of the article's change history. This is not a corrections form — use /corrections/ for factual errors.
AI agent procurement →
The contracts, SLAs, and evaluation criteria that distinguish agentic-AI procurement from SaaS procurement. 15 other pieces in this pillar.