Advancing Creative Physical Intelligence in Large Multimodal Models
Researchers introduce MM-CreativityBench, a benchmark testing whether large multimodal models can solve creative physical problems by identifying non-obvious tool uses in constrained environments. Current LMMs struggle not from lack of generation capability but from poor visual grounding, hallucinating attributes and overlooking relevant entities; the team proposes affordance-grounded alignment using preference learning to improve performance.
This research addresses a critical gap in evaluating large multimodal models beyond pattern recognition and question-answering tasks. The MM-CreativityBench benchmark moves beyond traditional benchmarks by requiring models to demonstrate creative problem-solving in physically constrained scenarios, measuring whether systems can identify how objects might be repurposed in non-obvious ways—a distinctly human form of intelligence. This capability distinction matters because it reveals fundamental limitations in current LMM architectures.
The research identifies a specific failure mode: models generate plausible-sounding solutions without grounding them in visual evidence, leading to hallucinated attributes and overlooked relevant entities. This represents a quality-of-reasoning problem rather than a capability ceiling. By introducing affordance-grounded alignment through Direct Preference Optimization, the team demonstrates that models can be trained to maintain visual grounding while exploring solutions iteratively, incorporating structured affordance knowledge to guide multi-turn planning.
For AI development, this work signals that next-generation multimodal systems require stronger mechanisms for visual consistency and grounded reasoning. The benchmark itself becomes a tool for evaluating progress on creative intelligence. The findings suggest that improving model reliability and reducing hallucination requires explicit preference-based training rather than just scale increases. Industry developers working on embodied AI, robotics, or autonomous systems would benefit from integrating similar grounding mechanisms. This research contributes foundational understanding for building more trustworthy, interpretable multimodal systems that solve real-world creative problems rather than pattern-matching training data.
- →Current LMMs fail at creative physical problem-solving primarily due to poor visual grounding and hallucination, not lack of generative capability.
- →MM-CreativityBench introduces the first benchmark specifically measuring creative tool-use in visually rich, physically constrained environments.
- →Affordance-grounded alignment using preference learning reduces hallucination errors and improves entity and part selection in LMMs.
- →The research demonstrates that structured affordance knowledge and multi-turn planning significantly enhance model performance on creative reasoning tasks.
- →This work indicates future LMM improvements require explicit grounding mechanisms rather than purely scaling model size.