Evals · Apr 19, 2026

Researchers introduce open-world evaluations to test AI capabilities beyond benchmark saturation

A new collaborative project called CRUX aims to measure frontier AI abilities through complex, real-world tasks rather than standardized benchmarks, with early results showing AI agents can now publish functional applications to app stores.

Trust65

HypeLow hype

1 source · cross-referenced

ShareX LinkedIn Email

TL;DR

A group of 17 researchers from academia, government, industry, and civil society launched CRUX, a project to conduct 'open-world evaluations' that test AI on complex, unstructured real-world tasks instead of standardized benchmarks.
In an initial experiment, an AI agent successfully developed and published an iOS app to Apple's App Store with only two errors, one requiring manual correction, raising concerns about potential AI-driven app store spam.
Open-world evaluations involve small sample sizes and human judgment rather than automated scoring, addressing limitations of benchmarks that can be artificially optimized or fail on incidental obstacles like CAPTCHAs.
The methodology aims to surface emerging AI capabilities and provide early warnings about risks across domains including R&D automation and AI governance.
Researchers identified best practices for open-world evaluations including explicit documentation of allowed human intervention, release of agent activity logs, and detailed analysis of decision-making processes.

Researchers at the intersection of AI policy and evaluation have formalized a new methodology for testing frontier AI capabilities on messy, real-world tasks. Rather than relying on standardized benchmarks that measure narrow competencies, open-world evaluations place AI agents in authentic environments where they must navigate unpredictable obstacles, human processes, and multiple decision points. The approach emerged from observations that state-of-the-art models have begun saturating major benchmark datasets, creating ambiguity about whether score plateaus reflect genuine capability limits or simply the limitations of overfitting to known test structures.

CRUX, a collaborative framework led by Kapoor and Narayanan, assembled researchers from academia, government agencies, nonprofits, and industry to systematize this evaluation approach. The project's first experiment tasked an AI agent with the full pipeline of mobile app development and distribution: writing code, configuring credentials, generating privacy policies, completing regulatory forms, and navigating Apple's review process. The agent completed the task with minimal failures—only two errors across the entire workflow, with one requiring human correction when the system forgot stored credentials and improvised a phone number for submission.

The experiment revealed both capability and vulnerability. The successful publication of a functional app to a major platform's marketplace demonstrates sophisticated multi-step reasoning and integration with external systems. Simultaneously, the ease of the submission process raised security implications: the researchers disclosed findings to Apple a month before publication, warning that autonomous agents could soon generate high-volume spam submissions designed to evade manual review. The technical cost was modest—roughly $25 for development—though monitoring the process consumed $975 in compute tokens.

Open-world evaluations differ structurally from benchmarks in ways that affect how conclusions should be drawn. Benchmarks typically consist of dozens or hundreds of automatically scored tasks, enabling statistical analysis and trend lines. Open-world evaluations typically involve single high-complexity tasks evaluated through qualitative analysis of agent logs, without standardized success criteria. This methodological difference makes open-world evals poorly suited for comparative leaderboards but better suited for detecting emerging capabilities that benchmarks might miss due to their fixed specifications. An AI agent might score low on a web-browsing benchmark because it fails to solve a CAPTCHA, yet possess the underlying capability to complete the real task if human assistance cleared that particular hurdle.

The CRUX team identified several best practices for reducing noise in open-world evaluations. Evaluators should pre-specify how much human intervention is permissible and document each instance. Released logs should record agent actions in sufficient detail to allow independent analysis and reproduction attempts. Evaluation reports should explain the specific reasoning behind reported outcomes rather than offering only pass-fail judgments. These standards aim to improve the scientific weight of evaluations that inherently lack the sample sizes and automation possible with benchmarks.

Looking forward, the collaboration plans to extend open-world evaluations to R&D automation, regulatory compliance scenarios, and other domains where benchmark tasks cannot capture real-world complexity. The project represents a shift in how the field thinks about measuring progress: not as a single curve of capability gain, but as a portfolio of evidence across multiple real-world contexts.

Sources

01AI Snake Oil — Narayanan & Kapoor — Open-world evaluations for measuring frontier AI capabilities

Also on Evals

Researchers introduce open-world evaluations to test AI capabilities beyond benchmark saturation

Evaluation Framework for LLM Applications: Practical Guidance from 700+ Engineers

New framework quantifies AI agent reliability gaps separate from capability gains