AIBullisharXiv โ CS AI ยท 5h ago7/10
๐ง
CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception
Researchers introduce CropVLM, a reinforcement learning-based method that enables Vision-Language Models to dynamically focus on relevant image regions for improved fine-grained understanding tasks. The approach works with existing VLMs without modification and demonstrates significant performance gains on text recognition and document analysis without requiring human-labeled training data.