y0news
← Feed
←Back to feed
🧠 AI🟒 BullishImportance 6/10

FiberTune: Preserving Action-Fiber Visual Residuals in Vision-Language-Action Fine-Tuning

arXiv – CS AI|Haihao Lin, Xiangsheng Huang, Xiao Yang, Weibang Zhou, Yiqi Zhang, Bo Yang, Simin Zeng, Jiawei Yang, Zhengyang Wang, Jiahui Du|
πŸ€–AI Summary

FiberTune is a new training methodology for vision-language-action (VLA) policies that prevents visual feature collapse during fine-tuning by preserving action-invariant visual information. The approach demonstrates consistent improvements across simulation benchmarks and physical robot tasks without adding computational overhead at inference time.

Analysis

FiberTune addresses a fundamental challenge in fine-tuning vision-language-action models: while action-supervised training effectively learns task-specific behaviors, it can inadvertently destroy visual structure in state representations that don't directly affect action predictions. This phenomenon, termed residual visual collapse along action fibers, creates brittle policies that fail when encountering visually novel but behaviorally equivalent situations. The researchers formalize this problem and propose filtering action-predictive feature directions while aligning remaining residuals to a frozen teacher model, effectively preserving the broader visual understanding learned during pretraining.

The empirical validation spans multiple architectures including pi_0.5 and OpenVLA-OFT across CALVIN and physical robot benchmarks, showing consistent gains of 5-10 percentage points in success rates. On real robot pick-place tasks, performance improved from 72.7% to 78.1%, demonstrating practical relevance beyond simulation. This work builds on the growing recognition that modern deep learning models often contain more information than strictly necessary for immediate task objectives, and that preserving this auxiliary structure improves generalization and robustness.

For the robotics and embodied AI community, FiberTune offers a practical regularization technique that integrates seamlessly into existing training pipelines without inference-time costs. The approach is particularly valuable as VLA models scale to increasingly complex tasks and diverse environments where visual generalization becomes critical. The diagnostic validation showing correlation between performance gains and preserved residual rank provides interpretability often missing in robotics research, offering practitioners concrete insights into why the method works and how to diagnose similar issues in their own systems.

Key Takeaways
  • β†’FiberTune prevents visual feature collapse in vision-language-action fine-tuning by filtering action-predictive directions and preserving residual visual structure
  • β†’Consistent improvements of 5-10 percentage points across simulation benchmarks and physical robot tasks without inference-time overhead
  • β†’The method aligns probe-filtered residuals to a frozen teacher while regularizing effective rank to maintain visual generalization
  • β†’Physical robot experiments show task success improvement from 72.7% to 78.1% on pick-place tasks using this training-time regularization
  • β†’Residual diagnostics confirm performance gains correlate with increased teacher alignment and rank, validating the action-fiber hypothesis
Read Original β†’via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains β€” you keep full control of your keys.
Connect Wallet to AI β†’How it works
Related Articles