Skip to content
Evals · Jun 23, 2026

Apple study finds LLM-as-a-judge panels provide roughly two independent votes’ worth of information despite nine judges

Researchers propose a framework to measure the informational value of multi-model evaluation panels and find correlated errors limit reliability.

Trust84
HypeLow hype

1 source · cross-referenced

ShareXLinkedInEmail
TL;DR
  • A nine-judge LLM-as-a-judge panel provides only about two independent votes’ worth of information due to correlated errors.
  • The panel’s actual accuracy falls 8–22 percentage points short of what independent voting would achieve.
  • Adding more judges or smarter aggregation algorithms closes at most 11% of the gap, and the best single judge often outperforms the full panel.
  • The bottleneck is correlated judges, not aggregation methods, implying scaling panels cannot substitute for independent evaluation.

Apple Machine Learning Research introduced a framework to quantify the true informational value of LLM-as-a-judge evaluation panels, testing a panel of nine frontier large language models from seven model families. The study found that the nine judges effectively provide only about two independent votes’ worth of information, with roughly three-quarters of the panel’s nominal independence lost due to correlated errors on the same items.

The research tested the panel on three natural language inference datasets, each with 100 human annotations per item, and found the panel’s actual accuracy fell 8–22 percentage points short of what independent voting would achieve. The best single judge matched or outperformed the full panel across all conditions, indicating that aggregation of correlated votes does not reliably improve evaluation outcomes.

Established methods for improving panel performance—such as adding more judges or using smarter aggregation algorithms—closed at most 11% of the gap between the panel’s performance and the independent-voting ideal, even when the correct answers were accessible. The authors attribute this shortfall to the bottleneck of correlated judges rather than limitations in aggregation techniques.

The study’s conclusions were robust across variations in prompt design, temperature settings, chain-of-thought reasoning, and a pairwise preference task (RewardBench). The authors used the Kish effective sample size (n_eff) and a Condorcet null model to quantify these findings, reinforcing that the deficit is structural rather than algorithmic.

Sources
  1. 01Apple — Machine Learning ResearchNine Judges, Two Effective Votes: Correlated Errors Undermine LLM Evaluation Panels
Also on Evals

Stories may contain errors. Dispatch is assembled with AI assistance and curated by human editors; despite the trust-score filter, mistakes happen. We correct publicly — every article links to its revision history. Nothing here is financial, legal, or medical advice. Verify before relying on any claim.

© 2026 Dispatch. No ads. No sponsorships. No paid placement. Reader-supported via Ko-fi.

Built by a person who cares about honest AI news.