Diversity Over Frequency: Rethinking Tool Use in Visual Chain-of-Thought Agents
Researchers discover that visual reasoning agents exhibit a 'tool-use collapse' phenomenon where models progressively abandon external visual tools while maintaining or improving task accuracy. By introducing entropy regularization to encourage diverse exploration rather than optimizing tool frequency, the team achieves superior performance on complex tasks like 3D spatial reasoning and medical visual question answering, suggesting diversity matters more than tool usage frequency.
This research challenges conventional assumptions about how AI agents should leverage external tools for visual reasoning tasks. The core finding—that models naturally reduce tool reliance despite maintaining performance—contradicts the intuition that more tool use should correlate with better outcomes. The researchers demonstrate this tool-use collapse phenomenon across multiple domains, from 3D spatial reasoning to medical imaging analysis, indicating a systematic pattern rather than isolated anomaly.
The asymmetry the team identifies is particularly revealing: completely removing tools degrades performance substantially, yet pushing models to use tools more frequently yields minimal gains while reducing solution diversity. This suggests that rigid incentivization strategies for tool use may actually constrain the reasoning space. The breakthrough comes through entropy regularization, which shifts focus from tool frequency to exploration diversity—enabling models to discover varied reasoning pathways even as explicit tool invocation decreases.
For the AI development community, these findings fundamentally reframe how external tools should be incorporated into agent training pipelines. Rather than treating tools as mandatory components to be maximized, the scaffolding framework presents them as exploratory resources that agents should access strategically rather than habitually. This has direct implications for building more efficient and capable visual reasoning systems.
The research suggests future work should prioritize rollout diversity metrics over tool-usage metrics during training. As multimodal AI systems become increasingly complex, understanding when and how agents should delegate to specialized tools versus relying on internal reasoning will be crucial for developing scalable, robust systems capable of genuine reasoning rather than surface-level tool invocation.
- →Models naturally reduce external tool use while maintaining task performance, suggesting tool reliance is not proportional to reasoning quality.
- →Entropy regularization encouraging diverse exploration outperforms both vanilla training and explicit tool-use incentivization strategies.
- →Tool-use collapse appears consistent across multiple domains including 3D spatial reasoning and medical visual question answering.
- →Diversity in rollout exploration matters more than tool frequency for achieving superior reasoning performance in visual agents.
- →Treating tools as scaffolding for broader exploration rather than mandatory components improves agent generalization and reasoning capability.