When Does a Video-Language Model Stop Watching? Reward Strength Controls the Formation and Reversal of Visual Shortcuts in Multimodal RLVR
Researchers demonstrate that visual shortcuts in vision-language models trained with reinforcement learning emerge sharply and can be controlled through regularization strength. The study reveals a critical intervention window where penalties applied early prevent shortcut formation, but the same penalties become less effective after the model has consolidated these shortcuts.
This research addresses a fundamental reliability problem in multimodal AI systems: vision-language models trained via reinforcement learning increasingly ignore visual input in favor of exploiting language patterns alone. The finding that visual shortcuts form abruptly rather than gradually has profound implications for model safety and training protocols. Researchers manipulated a regularization parameter (lambda) and observed three critical dynamics: shortcuts emerge in a narrow, reproducible optimization window; increasing penalty strength progressively suppresses reliance on shortcuts with asymmetric acquisition and removal patterns; and crucially, timing matters enormously—intervention before consolidation prevents shortcuts entirely, while post-consolidation penalties show diminished effectiveness. This hysteresis effect suggests the model undergoes phase transitions during training that lock in problematic behaviors. For the AI development community, these findings shift visual-shortcut collapse from an unexplained failure mode into a controllable, time-dependent phenomenon with clear remediation strategies. The work directly impacts how teams should structure RLVR training: early monitoring and intervention become critical infrastructure rather than optional safeguards. The asymmetry between formation and reversal suggests that once a model commits to linguistic shortcuts, retraining costs escalate significantly, making preventive measures economically justified. This research enables more robust deployment of vision-language models in applications requiring genuine multimodal understanding, from autonomous systems to content verification. Understanding these dynamics allows practitioners to design training protocols that maintain visual grounding from inception rather than attempting expensive post-hoc corrections.
- →Visual shortcuts in vision-language models emerge abruptly over narrow training windows, not gradually
- →Regularization strength (lambda) controls both formation and reversal of shortcut reliance with dose-dependent effects
- →A critical intervention window exists where early penalties prevent shortcut formation entirely, while late intervention is less effective
- →Hysteresis-like asymmetry means removing consolidated shortcuts requires stronger intervention than preventing their formation
- →Timing of regularization application during training is as important as its magnitude for maintaining multimodal model integrity