Lenz study finds 67% disagreement among top AI models on real-world claims
Analysis of 1,000 organic user submissions shows that majority consensus among AI systems does not equate to ground truth, particularly for nuanced or recent queries.
A study published on 28 May 2026 by the fact-checking platform Lenz has revealed that five leading large language models disagreed on the verdict of 67% of 1,000 real-world fact-check claims. The research analysed organic user submissions collected between February and May 2026, testing the capabilities of GPT-5.4, Claude Opus 4.7, Gemini 3 Pro, Gemini 3 Pro + Search, and Sonar Pro. The findings indicate that while the models showed non-trivial agreement, they frequently diverged, particularly on nuanced or middle-ground verdicts.
The analysis required the models to classify claims into one of four categories: True, Mostly True, Misleading, or False. An option to abstain was excluded to ensure symmetric cross-model comparison and force a definitive classification. The study calculated Krippendorff’s alpha (ordinal) at 0.639, indicating limited agreement among the models. This metric suggests that the models' verdicts are structured rather than random, but not consistent enough to treat the panel as a single interchangeable judge.
On 34% of claims, at least two models picked verdicts that were two or more buckets apart, such as True versus False. This indicates substantive disagreement beyond simple calibration shifts. The research highlighted that majority agreement among AI models does not equate to ground truth. A majority verdict is sometimes wrong, and an individual dissenting model is sometimes right, meaning the majority serves only as a structural reference point for measuring disagreement rather than a stand-in for correctness.
Gemini 3 Pro and Gemini 3 Pro + Search showed the highest peer agreement at 75%, a result attributed to them sharing a base model. In contrast, Claude Opus 4.7 and Gemini 3 Pro showed the lowest pairwise agreement at 53%. The study excluded Lenz’s own verdicts from the analysis to focus solely on frontier-model disagreement, noting that a meaningful accuracy comparison requires human-labeled ground truth which is currently unavailable for this specific corpus.
The research highlights that significant disagreement persists even among top-tier frontier systems when faced with recent, uncurated real-world queries. These claims were unlikely to appear in any training corpus with a gold label attached, removing the possibility of pattern-matching against benchmark leaderboards. A companion study is planned to human-label the corpus and compare both the frontier models and Lenz’s verdicts against human ground truth.


