AIBullisharXiv – CS AI · 9h ago6/10
🧠
Can VLMs Predict Future States? Bootstrapping World Models from Inverse Dynamics
Researchers demonstrate that vision-language models (VLMs) can predict future image states by first learning inverse dynamics (identifying actions from frame pairs), then using this capability to bootstrap forward prediction through synthetic data annotation and inference-time verification. The approach achieves competitive results with specialized image editing models on the Aurora-Bench benchmark.
🧠 GPT-4