GAUGE — the Enterprise Agentic Governance Benchmark
A scored diagnostic for enterprise agent-mode deployments. Six governance dimensions. Scored 0–100. Published annually. Methodology open. Corrections public.
What GAUGE measures
GAUGE is a scored benchmark of enterprise agent-mode deployments across six governance dimensions. Each dimension scores 0–5. Weights are fixed. Total out of 100. Published annually starting Q4 2026 as the GAUGE 2026 Enterprise Agent Mode Governance Index — the top 20 public enterprise deployments, scored under the methodology below.
The framework exists because procurement decisions about agentic AI in 2026 are being made on unverified citations. Executives need a shared, defensible rubric to compare deployments against each other. Without one, every deployment’s “success” is whatever its vendor’s case study says. GAUGE is what fills that gap.
GAUGE is not a maturity model. Maturity models answer “where are we on the adoption curve?” GAUGE answers “how likely is this specific deployment to hold up under regulatory, security, change-management, and commercial pressure over the next 18 months?” It is a diagnostic lens, not a certification, not a consulting frame, not a procurement gate on its own.
The six dimensions
Each dimension scores 0–5 against the rubric below. Dimensions and weights are fixed for v1. The methodology is itself on a 12-month review cadence — any change is dated and logged publicly.
1. Governance maturity · weight 20%
Federated model registry, approval workflows, deprecation policy. Does the enterprise know which agents it has deployed, who approved each one, and how they get retired?
- 0 — no registry, no approval workflow, ad-hoc deployments
- 3 — partial registry, central approval on high-risk agents only, deprecation handled case-by-case
- 5 — complete registry covering every agent in the enterprise, tiered approval workflow with documented criteria per tier, written deprecation policy with lifecycle dates
2. Threat model · weight 20%
Prompt injection, cross-agent delegation, data exfiltration, data poisoning. Does the deployment have a documented threat model that addresses the specific attack surface agents introduce — not just a standard application threat model copy-pasted?
- 0 — no threat model, or a traditional app-security model applied without agent-specific additions
- 3 — threat model exists, addresses prompt injection and basic delegation abuse
- 5 — threat model covers all four vectors, documented test scenarios, tabletop exercises run within the last 12 months
3. ROI evidence · weight 15%
Documented productivity lift with named baselines and measurement method. If the ROI claim were audited tomorrow, would the claim survive?
- 0 — vendor-reported ROI only, no internal measurement
- 3 — internal measurement exists, baseline documented, one validation round completed
- 5 — baselines and methodology documented, results reviewed quarterly, at least one audit round by someone outside the deploying team
4. Change management · weight 15%
Training completion, adoption metrics, scope-change governance. Do the people affected by the agent actually use it, and do they know when scope expands?
- 0 — deploy-and-hope, no training plan, no adoption tracking
- 3 — training exists, adoption metrics collected but not acted on
- 5 — training program with completion tracking, adoption metrics reviewed monthly, scope-change review board for any expansion beyond the original charter
5. Vendor lock-in · weight 15%
Data export, model portability, exit clauses. If the vendor fails, is acquired, or doubles the price in 18 months — what happens to the deployment?
- 0 — no data export path, no portability plan, standard vendor contract with no material exit clauses
- 3 — data export available, some model portability, exit clauses triggered on catastrophic vendor events only
- 5 — data export tested quarterly, architecture is model-portable (verified in staging), exit clauses include rate-change triggers and service-degradation triggers
6. Compliance posture · weight 15%
EU AI Act, GDPR, SOC 2 or ISO 27001, plus sector-specific frameworks (NIS2, DORA for financial services; HIPAA for healthcare; FedRAMP for public sector). Does the deployment’s compliance documentation match its actual use case?
- 0 — no specific compliance framework addressed
- 3 — primary framework addressed, documentation incomplete
- 5 — all applicable frameworks addressed, documentation complete, third-party review or formal self-assessment within the last 12 months
Scoring formula
Each dimension scores an integer 0–5. The weighted sum is scaled to a 0–100 total.
Score = (gov × 0.20 + threat × 0.20 + roi × 0.15 + change × 0.15 + lockin × 0.15 + compliance × 0.15) × 20
Worked example. A deployment scores: governance 4, threat 3, ROI 3, change management 4, vendor lock-in 2, compliance 4.
= (4 × 0.20 + 3 × 0.20 + 3 × 0.15 + 4 × 0.15 + 2 × 0.15 + 4 × 0.15) × 20
= (0.80 + 0.60 + 0.45 + 0.60 + 0.30 + 0.60) × 20
= 3.35 × 20
= 67 / 100
The example scores in the “functional governance, specific dimensions weak” band. The vendor lock-in score of 2 is the dominant risk — that’s where this enterprise would spend its next 90 days.
What scores mean
- 0–29 · Exposed. The deployment has significant gaps across multiple dimensions. Maps to the 88% of enterprise agentic AI deployments that Stanford Digital Economy Lab’s 2026 Enterprise AI Playbook identifies as operating at or below break-even. Intervention is required before any scaling decision.
- 30–49 · Partial. Governance exists in some form but compensating controls are stretched thin. Common score for pilots about to transition to production; the risk profile changes significantly at that transition if the score doesn’t move up first.
- 50–69 · Functional. Governance is working, one or two dimensions remain structurally weak. Where most enterprise deployments currently sit. The trajectory of the score matters more than the absolute number.
- 70–84 · Durable. Maps to the 12% of deployments in the same DEL dataset that clear 300% ROI. Vendor arrangements are portable. Threat model is current. Compliance is documented to audit-survivable quality.
- 85–100 · Exceptional. Rare. Worth auditing to verify the score is real rather than self-flattering. GAUGE 2026 Index inclusion criteria require independent verification for any claimed score above 85.
The bands are not equal-sized by intent. The gap between 69 and 70 is the threshold where the data suggests deployments move from break-even to compounding return. The gap between 29 and 30 is where deployments move from “actively harmful” to “needs work.” Most intervention effort should concentrate in the 50–70 range, where incremental improvement is both achievable and consequential.
Why GAUGE, not an existing framework
Several existing frameworks address parts of what GAUGE measures:
- NIST AI Risk Management Framework covers risk identification and treatment rigorously and is the right starting reference for enterprise AI governance generally. It does not score vendor lock-in or ROI defensibility, and its Generative AI Profile (NIST AI 600-1) treats agentic behaviour as a subset of GenAI rather than its own risk class.
- ISO/IEC 42001:2023 addresses AI management systems at the organisational level — the governance structures, roles, and documentation a company puts around AI. It does not score the agent-specific attack surface: cross-agent delegation, tool-use permissions, autonomous decision scope.
- EU AI Act classifies AI use cases by regulatory risk tier (unacceptable, high, limited, minimal). It does not score how well any specific deployment is governed inside those tiers — an Annex III high-risk use case deployed with weak controls versus strong controls both carry the same EU AI Act classification.
- Gartner AI Maturity Model and comparable consultancy maturity models describe “where is the organisation on the adoption curve.” They do not answer “is this specific deployment durable.”
GAUGE combines the governance rigor of NIST and ISO with agent-specific threat modeling, plus three commercial dimensions (ROI evidence, change management, vendor lock-in) that determine whether a deployment survives an 18-month review cycle.
The framework is intentionally scoped to agentic AI specifically — not AI in general. The agent-specific dimensions (autonomy level, tool-use permissions, cross-agent communication) surface risks that a general AI-governance score misses. An enterprise with a strong general AI governance posture can still score 40 on GAUGE if its agentic deployments are ungoverned.
How to use GAUGE
- Self-score your deployment using the free diagnostic. Takes a governance working group 30–45 minutes. The diagnostic includes the rubric above in working-doc form, an example scored deployment for reference, and a comparison view for organisations running multiple agents.
- Track the score over time. Score quarterly. A single score is interesting; a four-quarter trajectory is actionable. Scores rising 5+ points per quarter indicate compounding governance discipline. Flat scores on repeatedly-pledged improvements indicate the improvement plan isn’t resourced.
- Use the score cross-deployment. If the enterprise runs multiple agents, GAUGE surfaces which deployments are durable and which are fragile. The fragile ones are where the next incident will come from.
- Use GAUGE in procurement. A companion RFP template (60 questions mapped to the six dimensions, with evidence prompts per question) is in preparation for May 2026. Early-access copies go to newsletter subscribers via the monthly issue. Vendors who refuse to engage with a GAUGE-structured RFP tell the procurement committee something about the vendor.
The annual GAUGE Index
Starting Q4 2026, this publication publishes the GAUGE 2026 Enterprise Agent Mode Governance Index — the top 20 public enterprise deployments scored under GAUGE. The index is published annually, free to read, with full methodology and the scoring evidence for each included deployment.
Enterprises nominated for inclusion have 30 days to contest the score with documented evidence before publication. All scoring decisions are logged with change history. The index is designed to be citeable by analyst firms, journalists, and procurement committees without re-validation — if the score changes after publication, the change is dated and visible in the record for that enterprise.
If the Index scores enterprise X at 62 and enterprise X disagrees, the disagreement is public, the evidence is public, and the score updates publicly. This follows the same holding-up pattern as the Claim Archive: nothing is silently removed, nothing is silently updated, every revision is dated.
Download: GAUGE self-scoring diagnostic
The Excel diagnostic is a working-document version of the rubric above. It includes:
- The 0–5 scale with anchor descriptions for each dimension
- Three example-scored hypothetical deployments (low, mid, high) for calibration
- Weighted-sum formula pre-filled so edits to scores update the total automatically
- A comparison view for enterprises running multiple agents
- A 90-day intervention template keyed to the lowest-scoring dimension
The diagnostic is free. Signing up for the monthly newsletter is the delivery mechanism — the newsletter sends the diagnostic within minutes and then sends one email a month with newly-archived claims, verdict changes on existing claims, and annual Index updates.
Download the GAUGE self-scoring diagnostic →
Corrections
GAUGE is on a 12-month review cadence. Changes — a dimension added or removed, a weight adjusted, a scoring anchor clarified — are dated and logged in the public record. The first methodology review lands Q2 2027.
If a dimension is mis-scoped, a weight feels wrong, or an anchor description needs tightening, submit feedback via the corrections form. Corrections that land change the public methodology; corrections that don’t land get a public response explaining why. Either way, the exchange is visible, following the same Claim Archive methodology the rest of this publication runs on.
The intent of the public-corrections model is narrow: the GAUGE methodology should be defensible under questioning, not protected from it.