AINeutralarXiv – CS AI · 18h ago6/10
🧠
What Makes Video World Model Latents Action-Relevant: Prediction over Reconstruction
Researchers demonstrate that temporal video pretraining, not pixel reconstruction quality, drives action-relevant structure in video world model latent spaces. Across diverse encoder architectures, video-pretrained self-supervised models consistently outperform reconstruction-based approaches in recovering action information, with implications for developing more effective embodied AI systems.