One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
Researchers introduce OneWM-VLA, a new approach to vision-language-action models that compresses visual input to a single token per frame while maintaining or improving long-horizon task performance. The method achieves significant improvements on robotics benchmarks including 61.3% success on MetaWorld MT50 and 60% on real-world cloth folding tasks, demonstrating that excessive visual bandwidth in world models may be unnecessary.
OneWM-VLA addresses a fundamental design inefficiency in current vision-language-action models used for robotic control and long-horizon planning. Rather than processing full-resolution visual streams through world modules, the approach uses Adaptive Attention Pooling to compress each frame into a single semantic token, reducing computational overhead while preserving task-relevant information. This represents a paradigm shift in how researchers think about visual representation efficiency in embodied AI systems.
The technical innovation emerged from observing that existing world-model-augmented VLAs operate under constrained adaptation budgets when built on frozen pretrained backbones. Previous architectures treated world model rollouts as secondary to action prediction, creating misaligned optimization objectives. OneWM-VLA unifies these objectives through flow-matching, creating tighter coupling between latent state generation and action trajectories. This architectural coherence likely explains performance improvements beyond what reduced visual bandwidth alone would suggest.
The empirical results carry significant implications for robotics and embodied AI development. Achieving 61.3% success on MetaWorld MT50 (up from 47.9%) and 95.6% on LIBERO-Long tasks demonstrates that the approach scales across diverse benchmarks. The real-world validation—60% success on cloth folding versus 20% baseline—proves the method's practical viability beyond simulation. With only 14.71M LoRA parameters on a 2B backbone, OneWM-VLA shows resource efficiency gains matter substantially for deployment scenarios.
This research influences how developers will approach future VLA scaling and fine-tuning. Organizations building robotics systems can adopt similar token compression strategies to reduce inference costs and memory requirements without sacrificing performance, potentially enabling deployment on resource-constrained robotic platforms.
- →OneWM-VLA compresses visual input to single tokens per frame without compromising long-horizon robotic task performance
- →Achieves 61.3% success on MetaWorld MT50 and 60% on real-world cloth folding with only 14.71M LoRA parameters
- →Unified flow-matching objective between latent state and action prediction improves upon previous separate-decoder approaches
- →Demonstrates that visual bandwidth reduction may enable more efficient deployment of VLA models on resource-constrained platforms
- →Real-world validation on Piper robot arm suggests practical viability beyond simulation environments