AIBullisharXiv โ CS AI ยท 10h ago6/10
๐ง
VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images
Researchers introduce VisionFoundry, a synthetic data generation pipeline that uses LLMs and text-to-image models to create targeted training data for vision-language models. The approach addresses VLMs' weakness in visual perception tasks and demonstrates 7-10% improvements on benchmark tests without requiring human annotation or reference images.