AIBullisharXiv – CS AI · 15h ago7/10
🧠
FineVLA: Fine-Grained Instruction Alignment for Steerable Vision-Language-Action Policies
Researchers introduce FineVLA, a framework that enhances Vision-Language-Action models for robotics by incorporating fine-grained instruction supervision beyond simple goal-level commands. The system combines 972,247 trajectories into a curated dataset of 47,159 fine-grained trajectories and demonstrates that mixing fine-grained and coarse instructions improves real-world robot manipulation success rates to 62.7% compared to 49.9% with goal-level instructions alone.