ReCoVLA: VLM-Guided Reward Compilation for Failure Recovery in Vision-Language-Action Policies
ReCoVLA introduces a framework that enhances vision-language-action (VLA) policies by using external vision-language models to identify failures and guide residual policy training for recovery. The approach freezes pretrained VLA policies and compiles structured rewards for correction, achieving 66.7% success in simulation and 61.7% in zero-shot real-world deployment compared to 36.7% for baseline methods.
ReCoVLA addresses a critical limitation in current vision-language-action policies: their brittleness when encountering off-nominal or failure states during robotic manipulation tasks. Rather than retraining entire models or relying on direct VLM-generated actions, the framework strategically decouples high-level semantic understanding from low-level motor control. This architectural choice enables better generalization across different VLA architectures while maintaining computational efficiency through selective reward compilation.
The research builds on growing interest in hybrid approaches that combine large pretrained models with targeted fine-tuning. As robotics increasingly depends on foundation models for language understanding, the challenge shifts from initial task performance to robust failure recovery. ReCoVLA's use of external VLMs as semantic reward selectors rather than direct action generators represents a pragmatic middle ground, reducing the burden on language models while leveraging their strengths in contextual understanding.
The performance improvements are substantial: doubling success rates from baseline fine-tuning in simulation and maintaining competitive real-world performance without additional physical training demonstrates practical value. The zero-shot sim-to-real transfer particularly matters for robotics applications where real-world data collection is expensive. This approach could influence how teams develop robotic systems by emphasizing modular failure recovery strategies over monolithic policy learning, potentially reducing development timelines and data requirements for production deployments.
- βReCoVLA achieves 66.7% success in simulation compared to 36.7% baseline by using VLM-guided reward compilation for failure recovery
- βThe framework decouples semantic understanding from motor control by using vision-language models as reward selectors rather than action generators
- βZero-shot sim-to-real transfer achieves 61.7% success, enabling deployment without additional physical robot training
- βThe approach remains compatible with different VLA architectures by keeping pretrained policies frozen and training only residual recovery policies
- βModular failure recovery design reduces development complexity and data requirements compared to full policy retraining