AINeutralarXiv – CS AI · 9h ago6/10
🧠
Learning Visual Feature-Based World Models via Residual Latent Action
Researchers introduce Residual Latent Action (RLA), a new latent action representation learned from DINO visual features, enabling more efficient and accurate world models that predict future visual features rather than raw pixels. RLA-WM outperforms existing feature-based and video-diffusion approaches while being orders of magnitude faster, with applications in robot learning from offline video demonstrations.