Skip to content
Holding·last review10 May 2026

GPT-5.5 (released 23 Apr 2026) and Claude Opus 4.7 (released 16 Apr 2026) are not substitutable models for an enterprise running both agentic-coding workloads and knowledge-work workloads in 2026: GPT-5.5 leads the public evaluation evidence on agentic-coding and computer-use surfaces (Terminal-Bench 2.0 82.7% vs 69.4%; GDPval 84.9% vs 80.3%; FrontierMath Tiers 1-3 51.7% vs 43.8%) and runs roughly 72% fewer output tokens than Opus 4.7 on identical coding tasks per Artificial Analysis; Opus 4.7 leads the public evaluation evidence on contamination-resistant coding, finance, and vision-reasoning surfaces (SWE-Bench Pro 64.3% vs GPT-5.4 57.7%; Finance Agent v1.1 64.4%; CharXiv reasoning 78.3%; GPQA Diamond 94.2%) and reports a 36% AA-Omniscience hallucination rate against GPT-5.5's 86% on the same independent evaluation, a 50 percentage-point spread that is the load-bearing data point of any 2026 single-model standardisation decision. The procurement-architecture answer for an enterprise running both workload types is three-tier routing (GPT-5.5 with Codex for agentic coding; Opus 4.7 plus retrieval augmentation for knowledge work; Mythos-via-Glasswing or Opus 4.7 with verification layer for frontier and high-stakes-verification work), not single-model standardisation.

Claim created at publish; review on 60-day cadence (the frontier minor-cycle release tempo is six weeks, so 60 days covers roughly one minor-cycle window with margin). Anchor sources cited inline in the article: Artificial Analysis AA-Omniscience evaluation (the load-bearing hallucination spread), Anthropic Opus 4.7 announcement (16 Apr 2026), OpenAI GPT-5.5 announcement (23 Apr 2026), OpenAI GPT-5.5 system card (the under-supported '60% reduction in hallucinations' press-cycle figure), Vellum and llm-stats benchmark consolidations, Vals.ai SWE-Bench Verified leaderboard (the independent decontaminated run that closes the apparent vendor-card gap from ~1.1 points to ~0.6 points), CodeRabbit GPT-5.5 and Opus 4.7 third-party PR-review evaluations, the Artificial Analysis Opus 4.7 explainer covering the long-context retrieval recalibration (78.3% on Opus 4.6 to 32.2% on Opus 4.7, attributed to the model now reporting errors when information is missing rather than fabricating answers), the Decoder coverage of the GPT-5.5 token-efficiency framing (~40% fewer output tokens than GPT-5.4 supporting OpenAI's '~20% effective net cost increase' claim), the aibreakingwire reporting on OpenAI dropping SWE-Bench Verified from system-card disclosures over contamination concerns, and Anthropic's published filter-rescore analysis showing Opus 4.7's margin over Opus 4.6 holds on the SWE-bench memorisation-flagged subset. Sister claims: AM-147 (Firefox 150 / Claude Mythos disclosure as the canonical agentic-verification reference cited in the third-tier routing section), AM-146 (vendor 'ready-to-run' accuracy claims need named task / baseline / methodology — the AA-Omniscience spread is exactly that disclosure on hallucination), AM-145 (vendor switching is bound by contract not technical migration cost — relevant to the multi-vendor routing argument), AM-140 (procurement-committee six pre-pilot questions; this claim adds the model-routing question on top), AM-130 (procurement reader's four evidence classes; the AA-Omniscience benchmark sits in the 'independent third-party evaluation' class). Trigger conditions to revisit before next cadence: (a) a re-run of AA-Omniscience showing the GPT-5.5 / Opus 4.7 hallucination spread compressed to under 25 percentage points — at that gap the single-model-standardisation case becomes defensible again and the routing read needs reframing; (b) a new model release in the GPT-5.6 / Opus 4.8 / Gemini 3.2 slot that materially reorders either the agentic-coding or the knowledge-work leaderboard; (c) Claude Mythos Preview moving out of Glasswing-gated access into general API availability, which collapses the third-tier routing question into the second-tier one for most enterprises; (d) an independent decontaminated benchmark run (Vals.ai, third-party academic) that overturns the directional reading on a load-bearing category, particularly on Finance Agent v1.1 or AA-Omniscience; (e) a vendor disclosure from either Anthropic or OpenAI of an additional contamination signal on the SWE-Bench leaderboard that changes the procurement-defensibility reading on either side. The 60% hallucination-reduction press-cycle figure attached to GPT-5.5 is tracked as under-supported throughout: not in OpenAI's system card, which reports a 23% improvement in per-claim factual accuracy and a 3% reduction in per-response error rate.

Published
10 May 2026
Last reviewed
10 May 2026
Next review
+57d· 09 Jul 2026
Embed this claimiframe + oEmbed
HTML iframe
Paste-the-URL (Substack, Medium, Notion, WordPress)

The card auto-updates when the claim's status, last-reviewed date, or correction log changes. Embedders never need to refresh — the card is rendered live from the canonical record.

About this register

The Reporting register tracks claims published from articles addressed to senior enterprise IT leaders — CIOs, IT directors, heads of platform. Claims are reviewed on a 30–90 day cadence; each review either reaffirms the claim, marks one substantive part as Partial, or marks it Not holding once the underlying evidence has been overtaken.

Recent corrections in Reporting

  • AM-002 · Not holding · 06 May 2026

    URL state changed. The /the-agentic-ai-revolution-real-world-success-stories-and-strategic-insights-from-2024-2025/ slug now serves a deliberately rewritten retrospective (claimId AM-130, "Agentic AI 2024-2025 retrospective", published 04 May 2026) against audited primary sources. The 28 Apr 2026 redirect to /retractions/ has been lifted to allow that. AM-002 the claim remains Not holding — the original $3.50/dollar + 70% failure-rate framing was withdrawn and is not restored. AM-130 is a separate claim with its own evidence chain. Readers arriving at /holding/AM-002 see the withdrawal here; the article link surfaces the new piece at the URL the original lived at, with this entry as the audit trail.

  • AM-121 · Holding · 2 May 2026

    Klarna walk-back primary-source upgrade — added Siemiatkowski verbatim quotes via Bloomberg-cited-by-Fortune (9 May 2025) and the Uber-style freelance hiring detail via Entrepreneur. Closes the highest-priority evidence gap from the source dossier.

  • AM-115 · Holding · 29 Apr 2026

    Initial publication 29 Apr 2026 — the first Quarterly Claim Review Bulletin. The claim itself is recursive: it asserts that the bulletin will ship quarterly, and the next review (30 Jul 2026) tests whether the Q3 bulletin actually appeared. Status starts as 'up' because the claim is currently true (the Q2 bulletin shipped). The verdict at end of July 2026 will move to Holding, Partial (bulletin shipped but on a delayed cadence), or Not holding (no bulletin shipped). REVIEW: Peter — please verify claim text + cadence wording before removing rewriteInProgress flag.

Reviews coming up in Reporting

  • AM-003 · Holding · next +6d (19 May 2026)

    GPT-5 Pro's tiered-subscription model forces enterprises to classify problems by computational difficulty — $200/month…

  • AM-136 · Holding · next +22d (4 Jun 2026)

    Across the 24-month window May 2024 to April 2026, every major foundation-model provider (Anthropic, OpenAI, Google, AW…

  • AM-020 · Holding · next +36d (18 Jun 2026)

    The 40-60% TCO underestimate on enterprise agentic-AI deployments is not a cost-visibility failure — it is a cross-depa…