AM-137
← Back to ledgerHolding·last review5 May 2026
Agent evaluation in production resolves on three operational components that determine whether the chosen evaluation platform produces useful signal: eval-set design across three layers (50-200 calibration prompts, 30-100 edge-case prompts, 10-50 production-sampled prompts per week), drift detection across three signal classes (output-distribution, score-distribution, tool-use distribution), and a regression-budget framework that forces binary ship/hold decisions (defensible default 5% absolute decline on calibration set, 10% on edge-case set, per release window). The procurement decision (which platform to buy, covered at AM-122) is the easier half; the operational discipline is what most enterprises under-invest in even after buying a platform.
Cluster-gap piece between AM-122 (eval-tooling decision) and the MTTD-for-Agents framework. Cadence 60-day. Trigger conditions: foundation-model provider releasing a model that materially changes the eval-set noise floor; OpenTelemetry GenAI semantic conventions extending to evaluation events with industry-standards-grade adoption; landmark customer incident attributable to evaluation-discipline failure with published learning. Sister claims: AM-122 (eval procurement), AM-123 (observability), AM-126 (red-team), MTTD-for-Agents framework.
Permalink
/holding/AM-137/Embed this claimiframe + oEmbed
HTML iframe
Paste-the-URL (Substack, Medium, Notion, WordPress)
The card auto-updates when the claim's status, last-reviewed date, or correction log changes. Embedders never need to refresh — the card is rendered live from the canonical record.