CreativityBench: Evaluating Agent Creative Reasoning via Affordance-Based Tool Repurposing
Researchers introduce CreativityBench, a benchmark with 4K entities and 150K+ affordance annotations to evaluate how well large language models can creatively repurpose tools by reasoning about their properties rather than canonical uses. Evaluations across 10 state-of-the-art LLMs reveal significant limitations: models struggle to identify correct parts, affordances, and physical mechanisms needed for non-obvious solutions, with performance gains from scaling and reasoning strategies like Chain-of-Thought proving limited.
CreativityBench addresses a critical blind spot in large language model evaluation. While LLMs demonstrate strong performance on reasoning benchmarks and environment-interaction tasks, their capacity for creative problem-solving—specifically the ability to repurpose objects in novel ways—remains largely unexamined. This gap matters because real-world intelligence requires more than pattern matching; it demands understanding how objects function beyond their intended purposes and applying that knowledge under constraints.
The research builds on growing recognition that current evaluation frameworks miss important dimensions of reasoning. Traditional benchmarks focus on established problem-solving paradigms, but creative tool use requires models to map object properties to novel applications. The affordance-based approach mirrors how humans naturally think about objects: understanding their parts, material properties, and potential uses enables lateral thinking.
The findings reveal a sobering reality for AI development. Across ten leading models, performance plateaus despite increased model scale, suggesting that size alone does not unlock creative reasoning. This directly challenges the scaling hypothesis that larger models simply learn everything better. The disconnect between general reasoning ability and creative affordance discovery indicates these may require fundamentally different training approaches or architectural innovations.
For the AI research community, CreativityBench provides a new evaluation axis for assessing model capabilities. The benchmark's structured approach—grounding tasks in physical plausibility rather than arbitrary constraints—offers a principled way to measure progress. Future work addressing these limitations could drive meaningful advances in agent design, particularly for robotics and planning systems that must navigate physical environments creatively.
- →State-of-the-art LLMs can identify plausible objects for creative tasks but fail to determine correct parts, affordances, and underlying mechanisms.
- →Model scaling does not reliably improve creative tool use, suggesting size alone cannot address this reasoning gap.
- →Standard inference techniques like Chain-of-Thought provide limited benefits for affordance-based creative problem-solving.
- →CreativityBench's 14K grounded tasks and 150K+ affordance annotations establish a new evaluation framework for measuring creative reasoning.
- →Creative tool use represents an underexplored but critical dimension of intelligence for developing robust planning and reasoning agents.