AIBullisharXiv – CS AI · 9h ago7/10
🧠
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
Researchers introduce OneWM-VLA, a new approach to vision-language-action models that compresses visual input to a single token per frame while maintaining or improving long-horizon task performance. The method achieves significant improvements on robotics benchmarks including 61.3% success on MetaWorld MT50 and 60% on real-world cloth folding tasks, demonstrating that excessive visual bandwidth in world models may be unnecessary.