Skip to content
Evals · May 18, 2026

Open Agent Leaderboard measures full systems, not just models, across diverse real-world tasks

A new evaluation framework tests agent generality—how well the same system performs across coding, customer service, research, and other unfamiliar settings—while tracking both quality and cost of deployment.

Trust69
HypeLow hype

1 source · cross-referenced

ShareXLinkedInEmail
TL;DR
  • Hugging Face and IBM Research launched the Open Agent Leaderboard, an evaluation framework that benchmarks full agent systems (model plus architecture) rather than isolated models across six real-world task categories.
  • The leaderboard unifies six existing benchmarks—SWE-Bench Verified, BrowseComp+, AppWorld, and two tau2-Bench tasks—under a common protocol to test agent generality across diverse, unfamiliar settings.
  • Results show that general-purpose agents increasingly match specialized systems, but agent architecture choices (especially tool shortlisting) are beginning to influence outcomes alongside model selection.
  • The framework reports both success rates and per-task costs, revealing that failed runs cost 20–54% more than successful ones and that deployment efficiency varies significantly across configurations.
  • All methodology, the Exgentic evaluation framework, results, and supporting paper are released openly today.

Hugging Face and IBM Research today released the Open Agent Leaderboard, an evaluation framework designed to measure how well full agent systems generalize across unfamiliar tasks. Unlike traditional benchmarks that report a single model's score on isolated tasks, this leaderboard treats the entire agent—including its tools, planning logic, memory management, and error recovery—as the unit being measured.

The framework unifies six established benchmarks under a single protocol: SWE-Bench Verified (code repair in real repositories), BrowseComp+ (open-ended web research), AppWorld (cross-app personal task automation), and two tau2-Bench tasks for customer service and technical support. Each benchmark was designed to test different capabilities; together they approximate the diversity of real-world deployments without requiring custom tuning for each task.

The results reveal that agent architecture substantially influences performance independent of model selection. The top three configurations in early results all use the same underlying model but achieve different success rates and costs, demonstrating that implementation choices matter. Notably, tool shortlisting—enabling agents to focus on relevant tools rather than searching all available ones—improved performance across every tested model and converted several failing configurations into viable ones.

An unexpected finding emerged: general-purpose agents without task-specific optimization already match or exceed specialized systems built for individual benchmarks. This suggests that a single agent can increasingly handle multiple classes of work without task-specific retraining. However, failure patterns differ dramatically across systems—some fail quickly and cheaply, while others exhaust tokens before abandoning tasks, with failed runs costing 20–54% more than successful ones.

The leaderboard reports both quality (success rate per benchmark) and cost (per-task expense), allowing operators to assess whether a high-performing system is economically viable at scale. All components are released openly: the leaderboard interface, the Exgentic framework for reproducing evaluations, and a full methodology paper.

Sources
  1. 01Hugging FaceThe Open Agent Leaderboard
Also on Evals

Stories may contain errors. Dispatch is assembled with AI assistance and curated by human editors; despite the trust-score filter, mistakes happen. We correct publicly — every article links to its revision history. Nothing here is financial, legal, or medical advice. Verify before relying on any claim.

© 2026 Dispatch. No ads. No sponsorships. No paid placement. Reader-supported via Ko-fi.

Built by a person who cares about honest AI news.