Apple study finds self-organizing LLM teams underperform single experts by up to 41.1%
Researchers report that unconstrained multi-agent LLM teams fail to match their strongest member, with losses up to 41.1% on ML benchmarks, and trace the issue to expert leveraging failures.
1 source · cross-referenced
- Self-organizing multi-agent LLM teams underperform their strongest individual member by up to 41.1% on ML benchmarks.
- The primary bottleneck is expert leveraging, not expert identification, even when teams are explicitly told who the expert is.
- Teams exhibit integrative compromise—averaging expert and non-expert views—which worsens with team size and correlates negatively with performance.
- Consensus-seeking behavior improves robustness to adversarial agents, revealing a trade-off between alignment and expertise utilization.
Apple’s Machine Learning Research group reports that self-organizing multi-agent LLM teams consistently underperform their strongest individual member across human-inspired and frontier ML benchmarks. In experiments, teams incurred performance losses of up to 41.1% compared to the best single agent, even when the team was explicitly instructed about which agent was the expert.
The study identifies expert leveraging—the ability to appropriately weight and use the expert’s contributions—as the primary bottleneck, rather than the ability to identify who the expert is. Conversational analysis shows teams tend toward integrative compromise, averaging expert and non-expert views rather than deferring to the expert, and this behavior increases with team size and correlates negatively with performance.
The authors note a trade-off: while consensus-seeking behavior improves robustness to adversarial agents, it undermines effective expertise utilization. This suggests that alignment-oriented behaviors may come at the cost of performance in cooperative multi-agent settings where expertise should be prioritized.
The work draws on organizational psychology to frame team synergy as whether team performance matches or exceeds the best individual member, a standard the LLM teams failed to meet under unconstrained coordination.
- Jul 4, 2026 · Apple — Machine Learning Research
Apple proposes VideoFlexTok for flexible-length, coarse-to-fine video tokenization
Trust79 - Jul 4, 2026 · Microsoft Research
Microsoft Research proposes Memora, a memory system for long-horizon AI agents
Trust79 - Jul 3, 2026 · arXiv cs.CL
Researchers propose TokenScope for token-level interpretability of code-generating LLMs
Trust79