ScarfBench released to evaluate AI agents on enterprise Java framework migration
New benchmark measures whether AI coding agents can reliably modernize real-world enterprise Java applications across Spring, Jakarta EE, and Quarkus.
1 source · cross-referenced
- ScarfBench introduces 34 enterprise Java applications, 204 migration tasks, and 1,331 expert-written tests to evaluate AI agents on framework modernization.
- Frontier agents achieve less than 10% behavioral success on ScarfBench, highlighting gaps in preserving application behavior beyond compilation.
- Agents overestimate migration success: one agent reported 29 of 30 whole applications built successfully, but only 22 did.
- Migration effort is dominated by configuration and dependency resolution, not code translation alone.
- Benchmark includes build/deploy validation and is open-sourced with a public leaderboard.
Hugging Face’s research arm, in collaboration with IBM Research, released ScarfBench, an open benchmark designed to evaluate AI agents on cross-framework migration tasks in enterprise Java. The benchmark targets migrations across three major Java ecosystems: Spring, Jakarta EE, and Quarkus, emphasizing real-world applicability by requiring migrated applications to build, deploy, and pass behavioral validation.
ScarfBench comprises 34 applications, 102 framework implementations, 204 migration tasks, approximately 151,000 lines of code, and around 2,000 source and test files, including 1,331 expert-written tests. Unlike traditional code-generation benchmarks, ScarfBench measures end-to-end outcomes rather than similarity to reference implementations, providing a more realistic assessment of modernization quality.
Evaluations of state-of-the-art coding agents on ScarfBench show success rates vary widely across framework pairs, with whole-application migrations proving particularly difficult. Even the strongest agents achieve less than 10% behavioral success, illustrating a significant gap between generating compilable code and preserving application behavior during migration.
The benchmark’s findings indicate that agents often overestimate their success. For example, one agent reported 29 out of 30 whole applications as successfully built, but independent verification confirmed only 22 built correctly; the single application the agent classified as failed ultimately built successfully. This overconfidence suggests agent self-assessment should not be treated as a reliable signal of migration completion, reinforcing the need for independent build and test validation.
Analysis of agent behavior reveals migration is an iterative, dependency-driven process rather than a linear transformation. Agents frequently revisited configuration, web, database, and service layers, with configuration-related artifacts dominating migration effort. Common transitions included iterative cycles between configuration, web, and service layers, highlighting the complexity of managing cascading changes across an application’s architecture.
The benchmark also surfaces challenges beyond code translation, including environmental and tooling issues such as Docker cache inconsistencies, port connectivity problems, and Maven wrapper/build tooling issues. These operational concerns often delayed validation even when the source-code migration was largely complete, underscoring the importance of addressing infrastructure and tooling in modernization workflows.
ScarfBench is released as an open resource, including the benchmark dataset, evaluation infrastructure, public leaderboard, documentation, and open-source code. Researchers can use it to compare agent architectures, while practitioners can evaluate modernization solutions before deployment. The benchmark is designed to standardize progress toward autonomous application modernization and expose gaps in current agent capabilities.
- Jun 30, 2026 · arXiv cs.AI
New benchmark GPTNT reveals real-time collaboration gaps in multimodal agents
Trust79 - Jun 26, 2026 · arXiv cs.AI
Researchers propose broader evaluation dimensions for AI agents after benchmark accuracy saturates
Trust79 - Jun 26, 2026 · arXiv cs.CL
Contamination-aware benchmark finds selective abstention in instruction-tuned LLMs
Trust79