Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use
Researchers introduced PhysTool-Bench, a benchmark testing how well multimodal large language models (MLLMs) can recognize and use physical tools in real-world scenarios. Testing 13 leading models revealed significant limitations: even the best performer (Gemini-3.1-Pro) identified only 58.7% of tools in scenes and completed just 21% of end-to-end tasks, exposing critical gaps in perception and functional reasoning for embodied AI applications.
The research addresses a fundamental gap in MLLM evaluation by examining physical tool use—a capability essential for robots and embodied AI systems operating in real environments. While MLLMs have demonstrated proficiency with digital APIs and visual understanding tasks, their ability to recognize and plan with physical tools remained largely unmeasured until now. PhysTool-Bench's comprehensive dataset of 2,510 queries across 2,678 tools spanning manufacturing, electrical work, agriculture, and healthcare provides the first systematic assessment of this capability.
The benchmark reveals a two-tier performance problem. First, models struggle with basic perception—identifying all tools present in realistic, cluttered scenes. Second, and more concerning, they falter dramatically at planning stages, suggesting they lack the functional commonsense needed to map perceived tools to task requirements. This distinction is crucial because it pinpoints where development efforts should focus: improving scene understanding alone won't solve embodied AI challenges without corresponding advances in semantic reasoning about tool functionality.
These findings carry substantial implications for the embodied AI ecosystem. Companies developing robotic systems and autonomous agents relying on MLLM "brains" face a significant capability ceiling that current models cannot overcome. The 58.7% tool recognition rate and 21% end-to-end completion rate indicate these systems cannot yet reliably operate in unstructured real-world environments without substantial improvements. For investors and developers, this benchmarking work establishes clear performance targets and highlights that next-generation models must integrate stronger functional reasoning alongside better visual perception. The research suggests that achieving practical embodied AI at scale requires architectural innovations beyond current MLLM approaches.
- →Top MLLMs fail to identify roughly 40% of physical tools in realistic scenes, indicating fundamental perception limitations.
- →Planning accuracy drops dramatically after tool recognition, revealing a critical gap in functional commonsense reasoning.
- →PhysTool-Bench provides the first comprehensive benchmark for evaluating MLLM physical tool use across diverse domains.
- →Current MLLM capabilities are insufficient for reliable autonomous robot operation in unstructured real-world environments.
- →Future MLLM development must address both visual perception and semantic reasoning about tool functionality simultaneously.