CreativityBench Benchmark Reveals Major Gaps in LLMs' Tool Repurposing Abilities
A new evaluation framework shows that state-of-the-art language models can identify plausible objects for repurposing but struggle to reason about affordances and physical mechanisms needed for creative problem-solving.
1 source · single source
- Researchers introduced CreativityBench, a benchmark with 14K tasks grounded in a 4K-entity affordance knowledge base containing 150K+ annotations to evaluate creative tool use in LLMs.
- Testing across 10 advanced models revealed that while LLMs can often select appropriate objects, they consistently fail to identify correct parts, their affordances, and underlying physical mechanisms.
- Model scaling shows rapidly diminishing returns for creative reasoning; strong general reasoning performance does not reliably translate to affordance discovery.
- Common inference-time techniques like Chain-of-Thought provide only marginal improvements in creative tool use tasks, indicating this capability represents a distinct cognitive challenge.
Researchers at multiple institutions have created CreativityBench, a comprehensive evaluation framework designed to measure how well contemporary large language models can engage in creative problem-solving through unconventional tool use. Rather than testing whether models know standard applications of objects, the benchmark assesses their capacity to reason about affordances—the inherent properties and potential uses—of available items to solve novel problems under constraints.
The benchmark's foundation is a large-scale affordance knowledge base comprising 4,000 entities with over 150,000 affordance annotations. These annotations explicitly connect objects, their constituent parts, physical attributes, and actionable uses. Using this structured resource, the team generated 14,000 grounded tasks that require models to identify non-obvious but physically plausible solutions. Importantly, these tasks demand reasoning about specific parts and mechanisms, not just general object selection.
Evaluation across 10 state-of-the-art models—both proprietary and open-source—exposed consistent weaknesses in creative reasoning. Models frequently succeeded at the first step: selecting a plausible object for a given constraint-based problem. However, performance dropped substantially when tasks required identifying which specific parts of an object were relevant, reasoning about their affordances, or explaining the physical mechanism underlying a creative solution. This pattern held across different model sizes and architectures.
The researchers found that performance improvements from scaling model size reach diminishing returns quickly on creative affordance discovery tasks. More significantly, strong general reasoning capability—which these models demonstrate on standardized benchmarks—does not reliably transfer to creative tool repurposing. Even inference-time prompting strategies like Chain-of-Thought reasoning yielded only marginal gains, suggesting that creative affordance reasoning engages cognitive processes distinct from those tested by conventional benchmarks.
- May 22, 2026 · arXiv cs.AI
New Method Improves LLM Reasoning About Conflicting Beliefs in Complex Social Scenarios
Trust79 - May 20, 2026 · OpenAI — News
OpenAI model resolves 80-year-old discrete geometry conjecture
Trust67 - May 20, 2026 · arXiv cs.AI
Study evaluates how language models interpret personal health records to answer patient questions
Trust74