Let It Be Simple: One-Step Action Generation for Vision-Language-Action Models
Researchers demonstrate that vision-language-action (VLA) models can generate robot actions effectively in a single step by simply biasing training toward high-noise states, eliminating the need for complex multi-step diffusion techniques borrowed from image generation. The approach achieves performance matching ten-step decoding on standard benchmarks while reaching 95.6% accuracy on LIBERO-Long with a 1.4B parameter model.
This research challenges the conventional wisdom that diffusion-based models require sophisticated multi-step generation processes. The authors identify a fundamental difference between image generation and action prediction: while images benefit from iterative refinement across many steps, robot actions operate in a lower-dimensional space conditioned on rich contextual information. By recognizing this structural asymmetry, they develop a surprisingly elegant solution that removes unnecessary complexity from the training pipeline.
The findings emerge from a trend toward simplification in machine learning where researchers question inherited assumptions from adjacent domains. VLA models adapted image generation's iterative denoising approach without examining whether such computational overhead was necessary for action spaces. This work demonstrates that strong performance can emerge from standard diffusion training combined with intelligent scheduling rather than additional model components or specialized distillation techniques.
For the robotics and embodied AI industry, this represents a significant efficiency gain. Faster action generation reduces latency in robotic systems and decreases computational requirements during deployment. A 1.4B parameter model achieving near-perfect accuracy on complex manipulation tasks makes real-world robot applications more practical. The real-robot validation across different architectures suggests the method generalizes beyond benchmark environments.
The implications extend to model development efficiency. Removing teacher models, distillation stages, and auxiliary objectives simplifies the training pipeline while maintaining or improving performance. This democratizes VLA development by reducing computational requirements. Future work should explore how far single-step generation scales with model size and whether similar principles apply to other conditional generation tasks beyond robotics.
- βOne-step action generation in VLA models matches ten-step decoding performance when training incorporates high-noise bias schedules
- βThe approach requires no teacher models, distillation stages, or auxiliary objectives, simplifying the standard diffusion training pipeline
- βA 1.4B parameter model achieved 95.6% accuracy on LIBERO-Long, demonstrating scalability of the method
- βReal-robot bimanual experiments validate that the sampler trend generalizes beyond simulation benchmarks
- βThe work challenges the assumption that image generation techniques should directly transfer to lower-dimensional action prediction tasks