AIBullisharXiv โ CS AI ยท 9h ago6/10
๐ง
Pixel-level Scene Understanding in One Token: Visual States Need What-is-Where Composition
Researchers propose CroBo, a new visual state representation learning framework that helps robotic agents better understand dynamic environments by encoding both semantic identities and spatial locations of scene elements. The framework uses a global-to-local reconstruction method that compresses observations into compact tokens, achieving state-of-the-art performance on robot policy learning benchmarks.