AINeutralarXiv – CS AI · 6h ago6/10
🧠
BilliardPhys-Bench: Benchmarking Physical Reasoning and Visual Dynamics of Multimodal LLMs
Researchers introduced BilliardPhys-Bench, a benchmark that tests multimodal AI models' ability to predict physical interactions in billiards simulations. The evaluation reveals that leading LLMs from OpenAI, Anthropic, Google, and Alibaba struggle with dynamic physics reasoning, exhibiting systematic failures including a 'stasis bias' where models default to predicting no interaction when physical outcomes become difficult to infer.
🧠 Claude🧠 Gemini