Evals · Apr 29, 2026

Evaluation costs, not model training, now dominate AI development budgets

As AI systems grow more complex, the cost of evaluating them—especially agent-based systems—has surpassed training in many domains, fundamentally reshaping research workflows and access barriers.

Trust70

HypeSome hype

1 source · cross-referenced

ShareX LinkedIn Email

TL;DR

The Holistic Agent Leaderboard spent $40,000 to evaluate 21,730 agent rollouts across 9 models and 9 benchmarks, with single frontier-model runs costing up to $2,829 per task.
Agent benchmarks show cost spreads of four orders of magnitude across tasks and three orders within benchmarks, driven primarily by model choice, scaffolding decisions, and token budgets.
In scientific ML, evaluation costs can exceed training by two orders of magnitude: The Well benchmark requires 3,840 H100-hours ($9,600) for a full baseline sweep versus 960 H100-hours to evaluate a single new architecture.
Static benchmark compression techniques that achieved 100–200× cost reductions no longer apply effectively to agentic evaluations due to their inherent noisiness, scaffold sensitivity, and multi-turn variance.
Pareto-efficient agent configurations achieve comparable real-world accuracy at 4.4 to 10.8× lower cost than accuracy-optimal setups, indicating substantial inefficiency in current evaluation practices.

AI evaluation infrastructure has matured into its own computational burden. The Holistic Agent Leaderboard (HAL), a standardized framework for testing agent systems across coding, web navigation, and customer service tasks, spent approximately $40,000 to conduct 21,730 evaluations across 9 models and 9 benchmarks. Individual frontier-model assessments on GAIA benchmarks alone reached $2,829 per run before any cost optimization.

Cost variability within agent evaluation is extreme. Pricing spread across HAL tasks spans four orders of magnitude, and within single benchmarks, three orders. A Browser-Use agent with Claude Sonnet 4 achieved 40% accuracy on Online Mind2Web at $1,577, while a different SeeAct configuration reached 42% accuracy for $171. This 9× cost difference for a 2-point accuracy gap reflects how heavily evaluation expense depends on architecture decisions—specifically model selection, system prompting strategy (scaffold), and token budget—rather than fundamental capability differences.

The static-era compression toolkit no longer translates to agent systems. For earlier LLM benchmarks like HELM and MMLU, researchers achieved 100–200× cost reduction while preserving ranking fidelity through subsampling techniques. Agent benchmarks, characterized by multi-turn interactions and inherent stochasticity, resist such aggressive optimization. Mid-difficulty filtering—selecting tasks within 30–70% historical pass rates—yields only 2–3.5× cost reductions, far from the gains available for fixed-answer benchmarks.

Scientific machine learning exhibits the most extreme evaluation-compute dominance. The Well, a scientific machine learning benchmark spanning 16 datasets from fluid dynamics to magnetohydrodynamics, requires 3,840 H100-hours to run a full baseline sweep (approximately $9,600). Evaluating a single new architecture costs 960 H100-hours. Training a neural operator on these same tasks takes a single 12-hour H100 run, making evaluation roughly 80 times more expensive than training—a reversal of traditional deep learning economics where pretraining dominated.

Inefficiency persists even when controlling for task difficulty. Research from CLEAR evaluated six state-of-the-art agents on 300 enterprise tasks and found that accuracy-optimal configurations cost 4.4 to 10.8× more than Pareto-efficient alternatives while delivering comparable real-world performance. This suggests substantial waste in current evaluation practice and points toward framework-level optimization as a near-term necessity.

Sources

01Hugging Face — AI evals are becoming the new compute bottleneck

Also on Evals

Evaluation costs, not model training, now dominate AI development budgets

Evaluation Framework for LLM Applications: Practical Guidance from 700+ Engineers

Researchers introduce open-world evaluations to test AI capabilities beyond benchmark saturation

New framework quantifies AI agent reliability gaps separate from capability gains