🧠 AI⚪ NeutralImportance 6/10

CreativityBench: Evaluating Agent Creative Reasoning via Affordance-Based Tool Repurposing

arXiv – CS AI|Cheng Qian, Hyeonjeong Ha, Jiayu Liu, Jeonghwan Kim, Jiateng Liu, Bingxuan Li, Aditi Tiwari, Dwip Dalal, Zhenhailong Wang, Xiusi Chen, Mahdi Namazifar, Yunzhu Li, Heng Ji|May 7, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce CreativityBench, a benchmark with 4K entities and 150K+ affordance annotations to evaluate how well large language models can creatively repurpose tools by reasoning about their properties rather than canonical uses. Evaluations across 10 state-of-the-art LLMs reveal significant limitations: models struggle to identify correct parts, affordances, and physical mechanisms needed for non-obvious solutions, with performance gains from scaling and reasoning strategies like Chain-of-Thought proving limited.

Analysis

CreativityBench addresses a critical blind spot in large language model evaluation. While LLMs demonstrate strong performance on reasoning benchmarks and environment-interaction tasks, their capacity for creative problem-solving—specifically the ability to repurpose objects in novel ways—remains largely unexamined. This gap matters because real-world intelligence requires more than pattern matching; it demands understanding how objects function beyond their intended purposes and applying that knowledge under constraints.

The research builds on growing recognition that current evaluation frameworks miss important dimensions of reasoning. Traditional benchmarks focus on established problem-solving paradigms, but creative tool use requires models to map object properties to novel applications. The affordance-based approach mirrors how humans naturally think about objects: understanding their parts, material properties, and potential uses enables lateral thinking.

The findings reveal a sobering reality for AI development. Across ten leading models, performance plateaus despite increased model scale, suggesting that size alone does not unlock creative reasoning. This directly challenges the scaling hypothesis that larger models simply learn everything better. The disconnect between general reasoning ability and creative affordance discovery indicates these may require fundamentally different training approaches or architectural innovations.

For the AI research community, CreativityBench provides a new evaluation axis for assessing model capabilities. The benchmark's structured approach—grounding tasks in physical plausibility rather than arbitrary constraints—offers a principled way to measure progress. Future work addressing these limitations could drive meaningful advances in agent design, particularly for robotics and planning systems that must navigate physical environments creatively.

Key Takeaways

→State-of-the-art LLMs can identify plausible objects for creative tasks but fail to determine correct parts, affordances, and underlying mechanisms.
→Model scaling does not reliably improve creative tool use, suggesting size alone cannot address this reasoning gap.
→Standard inference techniques like Chain-of-Thought provide limited benefits for affordance-based creative problem-solving.
→CreativityBench's 14K grounded tasks and 150K+ affordance annotations establish a new evaluation framework for measuring creative reasoning.
→Creative tool use represents an underexplored but critical dimension of intelligence for developing robust planning and reasoning agents.

#llm-evaluation #creative-reasoning #affordance-learning #benchmark #agent-planning #model-scaling #reasoning-capability

Read Original →via arXiv – CS AI

Act on this with AI

Stay ahead of the market.

Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.

Connect Wallet to AI →How it works

AI19h ago

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

AI21h ago

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

AI1d ago

CreativityBench: Evaluating Agent Creative Reasoning via Affordance-Based Tool Repurposing

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

S&P 500 and NASDAQ hit record highs as AI chip stocks surge