Agent evaluation in production resolves on three operational components that determine whether the chosen evaluation platform produces useful signal: eval-set design across three layers (50-200 calibration prompts, 30-100 edge-case prompts, 10-50 production-sampled prompts per week), drift detection across three signal classes (output-distribution, score-distribution, tool-use distribution), and a regression-budget framework that forces binary ship/hold decisions (defensible default 5% absolute decline on calibration set, 10% on edge-case set, per release window). The procurement decision (which platform to buy, covered at AM-122) is the easier half; the operational discipline is what most enterprises under-invest in even after buying a platform.

Cluster-gap piece between AM-122 (eval-tooling decision) and the MTTD-for-Agents framework. Cadence 60-day. Trigger conditions: foundation-model provider releasing a model that materially changes the eval-set noise floor; OpenTelemetry GenAI semantic conventions extending to evaluation events with industry-standards-grade adoption; landmark customer incident attributable to evaluation-discipline failure with published learning. Sister claims: AM-122 (eval procurement), AM-123 (observability), AM-126 (red-team), MTTD-for-Agents framework.

Published

5 May 2026

Last reviewed

5 May 2026

Next review

+59d· 4 Jul 2026

Source piece

Agent evaluation in production: eval-set design, drift detection, and regression budgets for the deployed agentRead piece →

Permalink/holding/AM-137/

Embed this claimiframe + oEmbed

HTML iframe

<iframe src="https://agentmodeai.com/embed/claim/AM-137/" width="600" height="280" frameborder="0" scrolling="no" loading="lazy" referrerpolicy="strict-origin-when-cross-origin" title="AM-137: Holding — Agent Mode AI" style="border:0;max-width:100%;"></iframe>

Paste-the-URL (Substack, Medium, Notion, WordPress)

The card auto-updates when the claim's status, last-reviewed date, or correction log changes. Embedders never need to refresh — the card is rendered live from the canonical record.