Skip to content
Evals · Jun 30, 2026

New benchmark GPTNT reveals real-time collaboration gaps in multimodal agents

Benchmark built on 'Keep Talking and Nobody Explodes' shows state-of-the-art models fail to defuse bombs in real time despite human players succeeding.

Trust79
HypeLow hype

1 source · cross-referenced

ShareXLinkedInEmail
TL;DR
  • GPTNT is a new benchmark for evaluating real-time collaboration between multimodal agents using the cooperative video game 'Keep Talking and Nobody Explodes'.
  • The benchmark requires two agents to coordinate asynchronously and communicate in real time to defuse procedurally generated bomb puzzles under time pressure.
  • None of the closed- or open-source models tested defused a single bomb in real time, a bar that human players clear.
  • Critical weaknesses identified include state tracking, efficient action under time pressure, ambiguity handling, and error recovery.
  • GPTNT is designed to isolate collaboration from reliance on memorized solutions and evolves with the game's modding community.

Researchers introduced GPTNT, a benchmark designed to evaluate real-time collaboration between multimodal agents using the cooperative video game 'Keep Talking and Nobody Explodes'. The benchmark requires two agents to coordinate asynchronously: one agent sees and manipulates the bomb but lacks defusal instructions, while the other has the instructions but cannot see or manipulate the bomb. Success demands effective, real-time communication and coordination under time pressure, as neither agent can succeed independently.

Unlike turn-based evaluations, GPTNT requires agents to act and communicate in real time, reflecting conditions such as time pressure, information asymmetry, and imperfect communication that are typically studied in isolation. The benchmark is built on the real game, enabling procedural generation of puzzles and access to a living modding community, which allows the benchmark to evolve alongside model improvements rather than becoming obsolete once solved.

In controlled experiments, the researchers found that none of the closed- or open-source models tested defused a single bomb in real time, a benchmark that human players routinely achieve. The study identifies critical weaknesses in state tracking, efficient action under time pressure, ambiguity handling, and error recovery as key failure points for current systems.

GPTNT is explicitly designed to separate collaboration from reliance on memorized solutions. The instruction manual, the partner, or both can be withheld to isolate what a model derives in the moment from what it already knows. This design aims to measure genuine collaborative capability rather than rote recall or static performance.

The researchers release GPTNT as a benchmark for collaborative performance that current evaluations leave unmeasured. By leveraging the real game and its procedural generation, the benchmark can adapt and expand as models improve, ensuring ongoing relevance in measuring progress in multimodal agent collaboration.

Sources
  1. 01arXiv cs.AIGPTNT: Benchmarking Real-Time Collaboration Between Multimodal Agents on Keep Talking And Nobody Explodes
Also on Evals

Stories may contain errors. Dispatch is assembled with AI assistance and curated by human editors; despite the trust-score filter, mistakes happen. We correct publicly — every article links to its revision history. Nothing here is financial, legal, or medical advice. Verify before relying on any claim.

© 2026 Dispatch. No ads. No sponsorships. No paid placement. Reader-supported via Ko-fi.

Built by a person who cares about honest AI news.