AIBullisharXiv – CS AI · 7h ago7/10
🧠
DeMaVLA: A Vision-Language-Action Foundation Model for Generalizable Deformable Manipulation
Researchers introduce DeMaVLA, a Vision-Language-Action foundation model designed to enable robots to generalize deformable-object manipulation across diverse household tasks without requiring category-specific training. The model combines a VLM backbone with an efficient action expert using flow matching and is trained on 5,000 hours of real-world demonstrations plus corrective learning from robot failures, achieving strong performance on folding benchmarks.