AINeutralarXiv – CS AI · 3h ago6/10
🧠
ROVER: Routing Object-Centric Visual Evidence for Grounded Multi-Image Reasoning
Researchers introduce ROVER, a lightweight plugin that enhances multimodal large language models' ability to reason across multiple images by intelligently routing visual evidence to specific objects. The approach achieves significant performance improvements on grounded reasoning benchmarks while reducing computational overhead compared to existing methods.