Researchers propose broader evaluation dimensions for AI agents after benchmark accuracy saturates
A new arXiv preprint argues accuracy-centric benchmarks are insufficient once performance plateaus, and introduces an updated benchmark suite and human-collaboration experiments to assess efficiency, reliability, and other dimensions.
1 source · cross-referenced
- Accuracy saturation in AI benchmarks often leads to retirement or replacement, but this misses other performance dimensions like construct validity, efficiency, and reliability.
- The authors introduce CORE-Bench v1.1 and an out-of-distribution task suite (CORE-Bench OOD) to evaluate agents beyond accuracy.
- A randomized experiment finds human-agent collaboration yields a statistically significant speedup of about twofold on computational reproducibility tasks.
- The work critiques the dominant accuracy-centric evaluation paradigm and proposes a more rigorous alternative.
A new arXiv preprint argues that retiring or replacing benchmarks once accuracy saturates overlooks critical dimensions of agent performance. The authors propose expanding evaluation beyond accuracy to include construct validity, out-of-distribution generalizability, efficiency, reliability, the relative importance of the model versus scaffold, and uplift from human-agent collaboration.
The paper uses CORE-Bench Hard, a benchmark for computational reproducibility of scientific code, as a case study to demonstrate the value of multidimensional evaluation. Even after accuracy saturation, the authors find that CORE-Bench v1.1 and a new out-of-distribution task suite (CORE-Bench OOD) remain useful for measuring efficiency, reliability, model performance, and scaffold performance.
The authors also conduct a small-scale randomized experiment to quantify the uplift from human-agent collaboration on real-world computational reproducibility tasks. They report a statistically significant speedup of about a factor of two, noting that this is likely an underestimate because one-fifth of human-only reproductions failed to complete within the time limit.
The work introduces CORE-Bench v1.1 and CORE-Bench OOD as improved tools for evaluating agents beyond accuracy. These contributions are positioned as a more rigorous alternative to the dominant accuracy-centric evaluation paradigm, aiming to address construct validity issues and better reflect real-world performance.
- Jun 26, 2026 · arXiv cs.CL
Contamination-aware benchmark finds selective abstention in instruction-tuned LLMs
Trust79 - Jun 23, 2026 · Apple — Machine Learning Research
Apple study finds LLM-as-a-judge panels provide roughly two independent votes’ worth of information despite nine judges
Trust84 - Jun 18, 2026 · OpenAI — News
OpenAI releases LifeSciBench, an expert-authored benchmark for evaluating AI in life sciences
Trust75