CrossVLA: Cross-Paradigm Post-Training and Inference Optimization for Vision-Language-Action Models
CrossVLA presents a comprehensive empirical study optimizing Vision-Language-Action models across different architectural paradigms, introducing a flow-matching log-probability estimator that enables Direct Preference Optimization on continuous-action models. The research demonstrates significant performance improvements using DoRA over LoRA, achieving up to 20% gains on specific benchmarks, while revealing inference-time bottlenecks that constrain acceleration potential to 21%.
CrossVLA addresses a critical gap in Vision-Language-Action model optimization by extending Direct Preference Optimization—a proven post-training technique from language models—to continuous-action flow-matching architectures. The core innovation is a surrogate log-probability estimator that eliminates the computational burden of probability-flow ODE integration, making DPO practical for non-autoregressive models. This technical contribution democratizes preference alignment across the emerging diversity of VLA architectures rather than concentrating development on a single paradigm.
The empirical findings carry substantial implications for the rapidly maturing robotics and embodied AI sector. DoRA consistently outperforms LoRA as a parameter-efficient fine-tuning method, achieving mean improvements of 10.4 percentage points across LIBERO benchmarks with remarkable consistency—zero variance on Object manipulation tasks across three random seeds. These gains translate directly to robustness in downstream robotic applications. The inference-time analysis reveals hard constraints: the denoise loop consumes 78.6% of latency while prefix-K/V caching yields only 21% acceleration, indicating that future performance gains require architectural rather than optimization-layer interventions.
For the AI development community, CrossVLA signals that preference alignment methodologies are portable across architectural boundaries, reducing engineering effort required to improve model behavior across different paradigms. The public release of code, checkpoints, and training logs establishes CrossVLA as a reproducible foundation for follow-up research. The multi-view temporal projection head achieving 99.5% retrieval accuracy provides immediate downstream value for data-efficient training strategies.
- →DoRA parameter-efficient tuning achieves 10.4pp mean improvement over OpenVLA baseline with zero variance on object manipulation tasks
- →Surrogate flow-matching log-probability estimator enables Direct Preference Optimization on continuous-action models without ODE integration
- →Denoise loop dominates 78.6% of inference latency while prefix-K/V caching caps at 21% acceleration, indicating architectural bottlenecks
- →Multi-view temporal projection head achieves 99.5% retrieval recall for same-task initialization from 6000 LIBERO frames
- →Full reproducibility with open-sourced code, checkpoints, and training logs at github.com/lz-googlefycy/vla-lab