🧠 AI🟢 BullishImportance 7/10

Training Multi-Image Vision Agents via End2End Reinforcement Learning

arXiv – CS AI|Chengqi Dong, Chuhuai Yue, Hang He, Rongge Mao, Fenghe Tang, S Kevin Zhou, Zekun Xu, Xiaohan Wang, Jiajun Chai, Guojun Yin|April 6, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce IMAgent, an open-source visual AI agent trained with reinforcement learning to handle multi-image reasoning tasks. The system addresses limitations of current VLM-based agents that only process single images, using specialized tools for visual reflection and verification to maintain attention on image content throughout inference.

Key Takeaways

→IMAgent is the first open-source visual agent trained end-to-end with reinforcement learning for multi-image reasoning tasks.
→The system introduces visual reflection and verification tools to prevent VLMs from gradually neglecting visual inputs during inference.
→The research reveals how tool usage enhances agent performance from an attention perspective for the first time.
→IMAgent achieves state-of-the-art performance on both single and multi-image benchmarks without requiring costly supervised fine-tuning data.
→A new challenging multi-image QA dataset was created using a multi-agent system to fill existing data gaps.

Mentioned in AI

Companies

OpenAI→

Models

o1OpenAI

o3OpenAI