Evals · Jun 26, 2026

Researchers propose broader evaluation dimensions for AI agents after benchmark accuracy saturates

A new arXiv preprint argues accuracy-centric benchmarks are insufficient once performance plateaus, and introduces an updated benchmark suite and human-collaboration experiments to assess efficiency, reliability, and other dimensions.

Trust79

HypeLow hype

1 source · cross-referenced

ShareX LinkedIn Email

TL;DR

Accuracy saturation in AI benchmarks often leads to retirement or replacement, but this misses other performance dimensions like construct validity, efficiency, and reliability.
The authors introduce CORE-Bench v1.1 and an out-of-distribution task suite (CORE-Bench OOD) to evaluate agents beyond accuracy.
A randomized experiment finds human-agent collaboration yields a statistically significant speedup of about twofold on computational reproducibility tasks.
The work critiques the dominant accuracy-centric evaluation paradigm and proposes a more rigorous alternative.

A new arXiv preprint argues that retiring or replacing benchmarks once accuracy saturates overlooks critical dimensions of agent performance. The authors propose expanding evaluation beyond accuracy to include construct validity, out-of-distribution generalizability, efficiency, reliability, the relative importance of the model versus scaffold, and uplift from human-agent collaboration.

The paper uses CORE-Bench Hard, a benchmark for computational reproducibility of scientific code, as a case study to demonstrate the value of multidimensional evaluation. Even after accuracy saturation, the authors find that CORE-Bench v1.1 and a new out-of-distribution task suite (CORE-Bench OOD) remain useful for measuring efficiency, reliability, model performance, and scaffold performance.

The authors also conduct a small-scale randomized experiment to quantify the uplift from human-agent collaboration on real-world computational reproducibility tasks. They report a statistically significant speedup of about a factor of two, noting that this is likely an underestimate because one-fifth of human-only reproductions failed to complete within the time limit.

The work introduces CORE-Bench v1.1 and CORE-Bench OOD as improved tools for evaluating agents beyond accuracy. These contributions are positioned as a more rigorous alternative to the dominant accuracy-centric evaluation paradigm, aiming to address construct validity issues and better reflect real-world performance.

Sources

01arXiv cs.AI — Life After Benchmark Saturation: A Case Study of CORE-Bench

Also on Evals

Researchers propose broader evaluation dimensions for AI agents after benchmark accuracy saturates

Contamination-aware benchmark finds selective abstention in instruction-tuned LLMs

Apple study finds LLM-as-a-judge panels provide roughly two independent votes’ worth of information despite nine judges

OpenAI releases LifeSciBench, an expert-authored benchmark for evaluating AI in life sciences