Evals · May 18, 2026

Open Agent Leaderboard measures full systems, not just models, across diverse real-world tasks

A new evaluation framework tests agent generality—how well the same system performs across coding, customer service, research, and other unfamiliar settings—while tracking both quality and cost of deployment.

Trust69

HypeLow hype

1 source · cross-referenced

ShareX LinkedIn Email

TL;DR

Hugging Face and IBM Research launched the Open Agent Leaderboard, an evaluation framework that benchmarks full agent systems (model plus architecture) rather than isolated models across six real-world task categories.
The leaderboard unifies six existing benchmarks—SWE-Bench Verified, BrowseComp+, AppWorld, and two tau2-Bench tasks—under a common protocol to test agent generality across diverse, unfamiliar settings.
Results show that general-purpose agents increasingly match specialized systems, but agent architecture choices (especially tool shortlisting) are beginning to influence outcomes alongside model selection.
The framework reports both success rates and per-task costs, revealing that failed runs cost 20–54% more than successful ones and that deployment efficiency varies significantly across configurations.
All methodology, the Exgentic evaluation framework, results, and supporting paper are released openly today.

Hugging Face and IBM Research today released the Open Agent Leaderboard, an evaluation framework designed to measure how well full agent systems generalize across unfamiliar tasks. Unlike traditional benchmarks that report a single model's score on isolated tasks, this leaderboard treats the entire agent—including its tools, planning logic, memory management, and error recovery—as the unit being measured.

The framework unifies six established benchmarks under a single protocol: SWE-Bench Verified (code repair in real repositories), BrowseComp+ (open-ended web research), AppWorld (cross-app personal task automation), and two tau2-Bench tasks for customer service and technical support. Each benchmark was designed to test different capabilities; together they approximate the diversity of real-world deployments without requiring custom tuning for each task.

The results reveal that agent architecture substantially influences performance independent of model selection. The top three configurations in early results all use the same underlying model but achieve different success rates and costs, demonstrating that implementation choices matter. Notably, tool shortlisting—enabling agents to focus on relevant tools rather than searching all available ones—improved performance across every tested model and converted several failing configurations into viable ones.

An unexpected finding emerged: general-purpose agents without task-specific optimization already match or exceed specialized systems built for individual benchmarks. This suggests that a single agent can increasingly handle multiple classes of work without task-specific retraining. However, failure patterns differ dramatically across systems—some fail quickly and cheaply, while others exhaust tokens before abandoning tasks, with failed runs costing 20–54% more than successful ones.

The leaderboard reports both quality (success rate per benchmark) and cost (per-task expense), allowing operators to assess whether a high-performing system is economically viable at scale. All components are released openly: the leaderboard interface, the Exgentic framework for reproducing evaluations, and a full methodology paper.

Sources

01Hugging Face — The Open Agent Leaderboard

Also on Evals

Open Agent Leaderboard measures full systems, not just models, across diverse real-world tasks

Researchers release CLIR-Bench to evaluate multimodal QA over irregular clinical time series

Comparison finds automated evals correlate with human annotations in 100 traces

OpenAI flags reliability issues in SWE-Bench Pro coding benchmark