Apple study finds LLM-as-a-judge panels provide roughly two independent votes’ worth of information despite nine judges
Researchers propose a framework to measure the informational value of multi-model evaluation panels and find correlated errors limit reliability.
1 source · cross-referenced
- A nine-judge LLM-as-a-judge panel provides only about two independent votes’ worth of information due to correlated errors.
- The panel’s actual accuracy falls 8–22 percentage points short of what independent voting would achieve.
- Adding more judges or smarter aggregation algorithms closes at most 11% of the gap, and the best single judge often outperforms the full panel.
- The bottleneck is correlated judges, not aggregation methods, implying scaling panels cannot substitute for independent evaluation.
Apple Machine Learning Research introduced a framework to quantify the true informational value of LLM-as-a-judge evaluation panels, testing a panel of nine frontier large language models from seven model families. The study found that the nine judges effectively provide only about two independent votes’ worth of information, with roughly three-quarters of the panel’s nominal independence lost due to correlated errors on the same items.
The research tested the panel on three natural language inference datasets, each with 100 human annotations per item, and found the panel’s actual accuracy fell 8–22 percentage points short of what independent voting would achieve. The best single judge matched or outperformed the full panel across all conditions, indicating that aggregation of correlated votes does not reliably improve evaluation outcomes.
Established methods for improving panel performance—such as adding more judges or using smarter aggregation algorithms—closed at most 11% of the gap between the panel’s performance and the independent-voting ideal, even when the correct answers were accessible. The authors attribute this shortfall to the bottleneck of correlated judges rather than limitations in aggregation techniques.
The study’s conclusions were robust across variations in prompt design, temperature settings, chain-of-thought reasoning, and a pairwise preference task (RewardBench). The authors used the Kish effective sample size (n_eff) and a Condorcet null model to quantify these findings, reinforcing that the deficit is structural rather than algorithmic.
- Jun 18, 2026 · OpenAI — News
OpenAI releases LifeSciBench, an expert-authored benchmark for evaluating AI in life sciences
Trust75 - May 18, 2026 · Hugging Face
Open Agent Leaderboard measures full systems, not just models, across diverse real-world tasks
Trust69 - May 14, 2026 · TechCrunch
Forum AI recruits top experts to audit foundation models on high-stakes topics like geopolitics and finance
Trust53