AIBullisharXiv – CS AI · 14h ago7/10
🧠
Planning with the Views via Scene Self-Exploration
Researchers introduce ViewSuite, a benchmark revealing that Vision Language Models struggle to plan multi-step camera movements in 3D environments despite understanding individual view transformations. A self-exploration framework with view graph distillation dramatically improves planning capability, boosting Qwen2.5-VL-7B performance from 2.5% to 47.8% accuracy.
🧠 GPT-5🧠 Gemini