FiberTune: Preserving Action-Fiber Visual Residuals in Vision-Language-Action Fine-Tuning
FiberTune is a new training methodology for vision-language-action (VLA) policies that prevents visual feature collapse during fine-tuning by preserving action-invariant visual information. The approach demonstrates consistent improvements across simulation benchmarks and physical robot tasks without adding computational overhead at inference time.
FiberTune addresses a fundamental challenge in fine-tuning vision-language-action models: while action-supervised training effectively learns task-specific behaviors, it can inadvertently destroy visual structure in state representations that don't directly affect action predictions. This phenomenon, termed residual visual collapse along action fibers, creates brittle policies that fail when encountering visually novel but behaviorally equivalent situations. The researchers formalize this problem and propose filtering action-predictive feature directions while aligning remaining residuals to a frozen teacher model, effectively preserving the broader visual understanding learned during pretraining.
The empirical validation spans multiple architectures including pi_0.5 and OpenVLA-OFT across CALVIN and physical robot benchmarks, showing consistent gains of 5-10 percentage points in success rates. On real robot pick-place tasks, performance improved from 72.7% to 78.1%, demonstrating practical relevance beyond simulation. This work builds on the growing recognition that modern deep learning models often contain more information than strictly necessary for immediate task objectives, and that preserving this auxiliary structure improves generalization and robustness.
For the robotics and embodied AI community, FiberTune offers a practical regularization technique that integrates seamlessly into existing training pipelines without inference-time costs. The approach is particularly valuable as VLA models scale to increasingly complex tasks and diverse environments where visual generalization becomes critical. The diagnostic validation showing correlation between performance gains and preserved residual rank provides interpretability often missing in robotics research, offering practitioners concrete insights into why the method works and how to diagnose similar issues in their own systems.
- βFiberTune prevents visual feature collapse in vision-language-action fine-tuning by filtering action-predictive directions and preserving residual visual structure
- βConsistent improvements of 5-10 percentage points across simulation benchmarks and physical robot tasks without inference-time overhead
- βThe method aligns probe-filtered residuals to a frozen teacher while regularizing effective rank to maintain visual generalization
- βPhysical robot experiments show task success improvement from 72.7% to 78.1% on pick-place tasks using this training-time regularization
- βResidual diagnostics confirm performance gains correlate with increased teacher alignment and rank, validating the action-fiber hypothesis