Benchmarking Vision-Language-Action Models on SO-101: Failure and Recovery Analysis
Researchers introduce SO-101, a standardized real-world benchmark for evaluating Vision-Language-Action (VLA) models on affordable robotic platforms. The study benchmarks multiple VLA and imitation learning policies, revealing that execution instability is the dominant failure mode and that recovery capabilities vary significantly across architectures, highlighting the gap between simulation-based evaluations and real-world robotic deployment.
The robotics and AI research community has long relied on simulation environments and expensive robotic platforms for evaluating embodied AI systems, creating a significant blind spot regarding performance on affordable hardware. This research addresses that gap by systematically evaluating multiple Vision-Language-Action models on the low-cost SO-101 platform, providing the first standardized real-world benchmark that moves beyond simplistic success-rate metrics.
The study reveals critical insights about VLA model robustness under realistic deployment constraints. While stronger pretrained VLA models generally outperform imitation learning baselines, their advantage diminishes substantially when facing real-world embodiment uncertainty and hardware limitations. Execution instability emerges as the primary failure source rather than planning or perception errors, suggesting that model architectural choices significantly influence robustness to physical variability. The structured failure taxonomy and recovery-aware evaluation metrics introduced here represent a methodological advance, enabling researchers to identify specific failure modes and understand how different architectures handle error recovery.
This work has implications for democratizing robotics development and deployment. As robot costs decrease and more developers access affordable platforms, understanding real-world performance becomes increasingly valuable. The benchmark establishes practical expectations for what current VLA models can achieve outside controlled environments, guiding investment decisions in robotics startups and informing research priorities.
Future research should investigate why execution instability dominates and how architectural modifications can improve physical robustness. The SO-101 benchmark may become a standard evaluation tool similar to ImageNet for computer vision, driving innovation toward more reliable embodied AI systems suitable for cost-conscious applications.
- βSO-101 establishes the first standardized real-world benchmark for VLA models on affordable robotic hardware, revealing performance gaps between simulation and physical deployment.
- βExecution instability emerges as the dominant failure source, not planning or perception errors, indicating architectural choices critically influence real-world robustness.
- βStronger pretrained VLA models outperform imitation learning baselines, but task-dependent performance variations remain significant under low-cost robotic constraints.
- βRecovery capability varies substantially across VLA architectures, suggesting model design choices have outsized impact on handling error conditions.
- βStructured failure analysis and recovery metrics provide deeper insights than binary success rates, establishing methodology for more rigorous embodied AI evaluation.