New benchmark GPTNT reveals real-time collaboration gaps in multimodal agents
Benchmark built on 'Keep Talking and Nobody Explodes' shows state-of-the-art models fail to defuse bombs in real time despite human players succeeding.
1 source · cross-referenced
- GPTNT is a new benchmark for evaluating real-time collaboration between multimodal agents using the cooperative video game 'Keep Talking and Nobody Explodes'.
- The benchmark requires two agents to coordinate asynchronously and communicate in real time to defuse procedurally generated bomb puzzles under time pressure.
- None of the closed- or open-source models tested defused a single bomb in real time, a bar that human players clear.
- Critical weaknesses identified include state tracking, efficient action under time pressure, ambiguity handling, and error recovery.
- GPTNT is designed to isolate collaboration from reliance on memorized solutions and evolves with the game's modding community.
Researchers introduced GPTNT, a benchmark designed to evaluate real-time collaboration between multimodal agents using the cooperative video game 'Keep Talking and Nobody Explodes'. The benchmark requires two agents to coordinate asynchronously: one agent sees and manipulates the bomb but lacks defusal instructions, while the other has the instructions but cannot see or manipulate the bomb. Success demands effective, real-time communication and coordination under time pressure, as neither agent can succeed independently.
Unlike turn-based evaluations, GPTNT requires agents to act and communicate in real time, reflecting conditions such as time pressure, information asymmetry, and imperfect communication that are typically studied in isolation. The benchmark is built on the real game, enabling procedural generation of puzzles and access to a living modding community, which allows the benchmark to evolve alongside model improvements rather than becoming obsolete once solved.
In controlled experiments, the researchers found that none of the closed- or open-source models tested defused a single bomb in real time, a benchmark that human players routinely achieve. The study identifies critical weaknesses in state tracking, efficient action under time pressure, ambiguity handling, and error recovery as key failure points for current systems.
GPTNT is explicitly designed to separate collaboration from reliance on memorized solutions. The instruction manual, the partner, or both can be withheld to isolate what a model derives in the moment from what it already knows. This design aims to measure genuine collaborative capability rather than rote recall or static performance.
The researchers release GPTNT as a benchmark for collaborative performance that current evaluations leave unmeasured. By leveraging the real game and its procedural generation, the benchmark can adapt and expand as models improve, ensuring ongoing relevance in measuring progress in multimodal agent collaboration.
- Jun 26, 2026 · arXiv cs.AI
Researchers propose broader evaluation dimensions for AI agents after benchmark accuracy saturates
Trust79 - Jun 26, 2026 · arXiv cs.CL
Contamination-aware benchmark finds selective abstention in instruction-tuned LLMs
Trust79 - Jun 23, 2026 · Apple — Machine Learning Research
Apple study finds LLM-as-a-judge panels provide roughly two independent votes’ worth of information despite nine judges
Trust84