Emergent Semantic Representations in World Models through Physical Interaction without Linguistic Supervision
Researchers demonstrate that VAE-based world models develop organized spatial semantic representations through physical exploration alone, without linguistic input. The geometric structure of the physical world emerges as the primary organizing principle, with prediction performance and semantic alignment improving together across training, suggesting a shared underlying mechanism.
This research addresses a fundamental question in machine learning: what organizational principles guide neural network representations when learning from raw sensory data? The findings reveal that world models spontaneously develop spatially coherent latent spaces mirroring physical geometry, achieving 6.6x improvement in position representation accuracy compared to random baselines. The study's experimental design proves particularly compelling through its double-knockout approach, where standard KL regularization deliberately disrupts geometric structure, causing both prediction and semantic alignment to collapse simultaneously by step 50,000. Restoring geometric access by reducing beta hyperparameter from 0.1 to 0.001 recovers both capabilities together, providing strong causal evidence for the shared-driver hypothesis.
This work extends ongoing efforts to understand unsupervised representation learning in embodied AI systems. Previous research established that physical interaction generates rich environmental structure, but this study quantifies how that structure manifests in learned representations and proves its necessity for downstream performance. The negative Spearman correlation (r=-0.61) between prediction improvement and semantic alignment across temporal checkpoints confirms that these capabilities develop interdependently rather than separately.
For embodied AI and robotics development, these findings suggest that geometric grounding naturally emerges from prediction objectives without requiring explicit semantic supervision. This has practical implications for training agents in physical environments—developers can prioritize prediction loss without separately engineering spatial representations. The results also inform broader debates about inductive biases and the sufficiency of interaction-based learning for developing semantically meaningful models, positioning physical geometry as a fundamental organizing principle comparable to how humans develop spatial understanding through embodied exploration.
- →World models develop spatially organized semantic representations from physical exploration alone, achieving 6.6x better position accuracy than random encoders
- →Prediction performance and semantic alignment co-improve during training, supporting a shared-driver hypothesis linking these capabilities
- →KL regularization disrupts geometric structure and simultaneously collapses both prediction and semantic alignment, proving their causal interdependence
- →Physical world geometry emerges as the primary organizing principle of learned representations, independent of linguistic supervision
- →Findings enable more efficient training of embodied agents by prioritizing prediction objectives without explicit semantic engineering