Research · May 7, 2026

CreativityBench Benchmark Reveals Major Gaps in LLMs' Tool Repurposing Abilities

A new evaluation framework shows that state-of-the-art language models can identify plausible objects for repurposing but struggle to reason about affordances and physical mechanisms needed for creative problem-solving.

Trust79

HypeLow hype

1 source · single source

ShareX LinkedIn Email

TL;DR

Researchers introduced CreativityBench, a benchmark with 14K tasks grounded in a 4K-entity affordance knowledge base containing 150K+ annotations to evaluate creative tool use in LLMs.
Testing across 10 advanced models revealed that while LLMs can often select appropriate objects, they consistently fail to identify correct parts, their affordances, and underlying physical mechanisms.
Model scaling shows rapidly diminishing returns for creative reasoning; strong general reasoning performance does not reliably translate to affordance discovery.
Common inference-time techniques like Chain-of-Thought provide only marginal improvements in creative tool use tasks, indicating this capability represents a distinct cognitive challenge.

Researchers at multiple institutions have created CreativityBench, a comprehensive evaluation framework designed to measure how well contemporary large language models can engage in creative problem-solving through unconventional tool use. Rather than testing whether models know standard applications of objects, the benchmark assesses their capacity to reason about affordances—the inherent properties and potential uses—of available items to solve novel problems under constraints.

The benchmark's foundation is a large-scale affordance knowledge base comprising 4,000 entities with over 150,000 affordance annotations. These annotations explicitly connect objects, their constituent parts, physical attributes, and actionable uses. Using this structured resource, the team generated 14,000 grounded tasks that require models to identify non-obvious but physically plausible solutions. Importantly, these tasks demand reasoning about specific parts and mechanisms, not just general object selection.

Evaluation across 10 state-of-the-art models—both proprietary and open-source—exposed consistent weaknesses in creative reasoning. Models frequently succeeded at the first step: selecting a plausible object for a given constraint-based problem. However, performance dropped substantially when tasks required identifying which specific parts of an object were relevant, reasoning about their affordances, or explaining the physical mechanism underlying a creative solution. This pattern held across different model sizes and architectures.

The researchers found that performance improvements from scaling model size reach diminishing returns quickly on creative affordance discovery tasks. More significantly, strong general reasoning capability—which these models demonstrate on standardized benchmarks—does not reliably transfer to creative tool repurposing. Even inference-time prompting strategies like Chain-of-Thought reasoning yielded only marginal gains, suggesting that creative affordance reasoning engages cognitive processes distinct from those tested by conventional benchmarks.

Sources

01arXiv cs.AI — CreativityBench: Evaluating Agent Creative Reasoning via Affordance-Based Tool Repurposing

Also on Research

CreativityBench Benchmark Reveals Major Gaps in LLMs' Tool Repurposing Abilities

Researchers introduce Cura 1T, a healthcare-specialized LLM trained via a human-gated self-evolution loop

GraphDx framework improves LLM-based clinical diagnosis accuracy and reduces test costs in study

Researchers propose Causal-Audit, a framework for explicit and auditable graph-based causal reasoning in LLMs