AIBullisharXiv – CS AI · 9h ago7/10
🧠
GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning
Researchers introduce GazeVLM, a vision-language model that implements active attention control mechanisms mimicking human visual reasoning. The 4B-parameter model autonomously generates gaze tokens to dynamically focus on task-relevant visual details, achieving 4-5% performance improvements over comparable VLMs without increasing context window size.