How the Claim Archive works

The methodology behind every archived claim on Agent Mode AI.

The Claim Archive exists because of a gap that bothered us into existing.

Every week, enterprise AI vendors, analyst firms, academics, and major publications make specific, testable, measurable claims about what agentic AI can and can’t do. Most of those claims are never tested again. They enter the discourse, become received wisdom, and stay cited in vendor pitches and strategy decks long after the evidence around them has shifted — sometimes in their favour, often not.

No publication tracks this systematically. Analyst firms don’t publish their own track record. Vendor announcements disappear into press archives with no mechanism for verification. The consensus about “what we collectively believed in Q2 2026” gets reconstructed after the fact through selective memory. That gap is what this archive fills.

The promise is narrow and literal: every significant claim we log gets reviewed on a schedule, evaluated against current evidence, and given a public verdict that moves as the evidence moves. Nothing is quietly removed. Nothing is silently updated. Every change is dated and visible.

What gets logged

The archive is deliberately bounded. We log claims from six source categories:

Vendor announcements from the 15 enterprise AI vendors we track: Anthropic, OpenAI, Google, Microsoft, Meta, Salesforce, ServiceNow, Oracle, SAP, IBM, Palantir, Databricks, Snowflake, HuggingFace, and UiPath. Press releases, earnings calls, keynote announcements, product pages.

Analyst reports from five firms: Gartner, Forrester, IDC, McKinsey Global Institute, and BCG.

Peer-reviewed research from recognised venues: NeurIPS, ICML, ICLR, ACL, EMNLP, FAccT, AIES, Carnegie Mellon benchmark releases, and the Stanford AI Index.

Tier-1 publication reporting from The Information, WSJ tech, FT tech, Reuters, Bloomberg, HBR, and MIT Sloan Management Review.

Regulatory and governmental documents including EU AI Act guidance, NIST publications, national AI strategies, and competition-authority rulings.

Consensus snapshots — the only category that requires a special procedure. A consensus snapshot only enters the archive when at least seven dated pieces of evidence from four distinct tracked sources converge on the same position. This procedure exists because “what everyone was saying in March” is precisely the kind of claim most prone to projection. The high bar makes it rare and expensive, which is the point.

For a claim from any of these sources to be logged, it also has to be testable and significant. Aspirational statements and marketing rhetoric don’t qualify — the claim must be something evidence can speak to. And not every testable claim makes the cut; we prioritise claims that are visibly influencing enterprise AI decisions somewhere.

Everything else is explicitly out of scope: private statements, social media below the tier-1 threshold, forum discussions, unattributed blog posts, marketing from vendors outside the tracked list, and consumer AI claims without enterprise implications. The scope rule itself is versioned; changes require a fourteen-day public-visibility window before taking effect.

What a verdict means

Every archived claim carries one of three active verdicts, reviewed on a 30-, 60-, 90-, or 180-day cadence depending on how fast the underlying evidence can realistically shift:

Holding. Current evidence continues to support the claim substantially. No significant counterevidence has surfaced.

Strengthened. Evidence published since the original claim confirms or extends its scope. The claim holds with higher confidence than at log-time.

Weakened. Partial counterevidence has surfaced. The claim remains true in some form but with narrower scope, caveats, or reduced magnitude.

A fourth status, Retracted, exists only for archive entries withdrawn due to sourcing or methodological errors on our part. These are rare and the original record stays visible with a dated explanation.

Claims can also carry a Context-shifted label independent of verdict — when the underlying benchmark, market category, or measurement framework has changed structurally. A claim can be simultaneously Holding and Context-shifted. This captures nuance without becoming an escape-hatch verdict.

Why we stop at Weakened

You won’t see a verdict on this archive harsher than Weakened, even when the evidence might support stronger language. This is a deliberate editorial choice, not a methodological limitation.

We are a publication operating without legal retainers or external review layers. Methodology is our sole defence against the natural consequences of saying “this claim from this named vendor no longer holds.” Weakened is the verdict whose evidentiary standards we can fully underwrite every time — strongly supported by multiple primary sources, its reasoning publicly visible, its counter-evidence documented. It is the strongest defensible line we can hold consistently.

Stronger verdicts would require infrastructure we have chosen not to build. Weaker verdicts would understate what the evidence shows. Weakened is where our standards and our capacity meet.

When the evidence against a claim is decisive — when the source itself has retracted, or peer-reviewed research conclusively weighs against — this is reflected in the strength of the review memo, not in an escalated verdict label. Read the memo. The reasoning is there.

The counter-evidence discipline

Every Weakened verdict in this archive carries a mandatory field: counter-evidence considered.

This field lists the strongest arguments against the verdict — the evidence that a sharp critic of our position would raise — and explains why we don’t find them persuasive. It is public on every claim page.

This exists because a verdict that only shows the evidence supporting it is not a verdict; it’s an argument. A verdict that shows both the evidence for and the evidence against, and still stands, is something harder to argue with. Two primary sources minimum for every Weakened claim. Counter-arguments visible. Reasoning documented.

If we can’t populate counter-evidence considered honestly, we don’t publish the verdict. The field is the hinge on which our methodology turns.

How reviews are made

Every review follows the same two-stage workflow.

Claude conducts the evidence-gathering: searches for subsequent coverage of the claim, statements from the original source, published counterevidence, related metrics, academic follow-up. Claude weighs the evidence, drafts a verdict, and writes the review memo with citations.

Then Claude presents the findings conversationally. The standard opening question from the human side is adversarial: “What counter-evidence would a sharp critic raise against this verdict, and how have you weighed it?” Claude responds with the explicit counter-evidence analysis. Only after this adversarial check does the review get signed and published.

Every published review carries one of two public oversight labels:

Peer-reviewed signoff — the standard workflow above. Claude-led research, conversational presentation, adversarial check, signoff. This is most reviews.

Peter-led deep review — used for signature reviews, significant industry claims, and topics requiring primary-source reading. Peter reads the sources directly, typically an hour to two hours per review. Roughly one review in five.

Which label applies to each review is visible at the top of the review memo. Readers can weight verdicts accordingly. Some readers will give more weight to Peter-led deep reviews; others will treat them equally. That is the reader’s judgement to make, and they can only make it if the distinction is visible.

We chose this labelling over pretending all reviews have identical depth of human oversight. The production reality is that we publish roughly five reviews per week with one human doing the signoff work. The alternative to transparent labelling is misleading uniformity. We chose transparency.

What this makes the archive for

Three uses drive the design.

Citation infrastructure. Every claim has a permanent URL that never changes. A CIO writing an internal memo can link to a specific claim with its current verdict — not as a blog reference, but as a dated, sourced, verified-or-not artefact. Internal memos are where AI buying decisions actually get justified; we want to be citable at that level.

An audit trail of discourse. Over months and years, this becomes a primary historical source for anyone studying how enterprise AI discourse evolved. Which vendor claims aged well. Which analyst predictions held up. Which consensus positions quietly dissolved. This is the archive’s longest-lived value and the reason we treat permanence as sacred.

Evergreen content. Every review published here is a content event — a LinkedIn post, a newsletter item, sometimes a full article when the pattern across multiple reviews becomes worth explaining. The archive doesn’t compete with the publication; it feeds it.

Source preservation

Every source URL in the archive is paired with an archive.org Wayback Machine snapshot generated at logging time. When vendor blog posts get moved, when analyst reports fall behind paywalls, when news sites migrate to new platforms — the original claim, at the date we logged it, remains verifiable via the snapshot.

This preservation layer is what makes the archive’s integrity claim meaningful over long horizons. Without it, the archive becomes a list of claims linked to dead pages within five years.

Change history on every record

Every record in the archive carries a public change log. Any correction — from a typo in a claim text to a domain-tag re-categorisation to a source-date correction — creates a dated entry showing the old value, the new value, and the reason. Publicly visible on the claim page.

This exists because the entire premise of this archive is accountability over time. A publication that claims to track what others said must also track what it has said, and be willing to show every adjustment. No silent edits.

How to use the archive

Four interfaces into the data:

The archive index at /archive/ is a filterable table of all claims, sortable by source, status, date, and domain. Use this for browsing or looking up specific claims.

Individual claim pages at /claims/[id] show a single claim with its full review history, counter-evidence, change log, and source snapshot. These are the primary citation targets. The URL at the top of any claim page is permanent and safe to link.

The insights page at /archive/insights shows patterns: verdict distributions over time, source breakdowns, activity heatmaps, status-change timelines. Useful for understanding the archive’s state in aggregate.

Feeds and bundles: an RSS feed at /archive/feed/new-claims for newly-logged claims, another at /archive/feed/new-reviews for published reviews, and a third at /archive/feed/status-changes for verdict updates. Quarterly CSV and JSON bundles at /archive/data/ contain the full dataset for researchers and analysts who want to work with the data directly.

Corrections

If you see an error — factual, procedural, or methodological — email corrections@agentmodeai.com. Corrections get dated, appended to the relevant record’s change log, and the affected verdict is re-reviewed if the error is material.

Amendments to this methodology

This document is the current operating standard. It will evolve. Every amendment is drafted publicly, posted with a fourteen-day visibility window before taking effect, and listed with its date on the amendments page. No methodology change retroactively alters previously-logged claims or published reviews. The archive’s integrity depends on historical stability.

If we apply this discipline to our own method, we earn the standing to apply it to anyone else’s claims.

Last methodology revision: 19 April 2026. See all amendments →