Rebalancing Reference Frame Dominance to Improve Motion in Image-to-Video Models
Researchers identify reference-frame dominance as the cause of static motion in image-to-video models and propose DyMoS, a training-free method that rebalances attention mechanisms to improve motion dynamics while preserving image fidelity. The approach requires no model retraining and introduces a single controllable parameter for motion strength adjustment.
Image-to-video (I2V) generation has long struggled with producing sufficiently dynamic motion compared to text-to-video counterparts. This technical limitation stems from a fundamental architectural challenge: the reference image's influence propagates too heavily through the generated sequence, constraining the model's ability to create natural inter-frame variations. Prior solutions attempted to address this by deliberately weakening the image conditioning signal, but these approaches either demanded additional training cycles or compromised visual fidelity to the source image. The research identifies a specific mechanism—excessive self-attention allocation to reference-frame tokens—that explains why generated frames remain overly constrained by the initial image. DyMoS addresses this through attention rebalancing during early denoising stages, effectively decoupling reference fidelity from motion generation without architectural modifications. The method's training-free nature and single scalar parameter for motion control represent practical advantages for deployment across existing model variants. The technique operates entirely at inference time, making it immediately applicable to deployed systems without requiring model weight updates. For developers building video generation applications, this approach offers a practical solution to a persistent quality limitation. Users seeking more dynamic outputs from I2V systems gain access to tunable motion strength without sacrificing their input image's visual characteristics. The research demonstrates consistent improvements across multiple state-of-the-art backbones, suggesting broad applicability. The work exemplifies how understanding architectural bottlenecks can yield elegant, practical solutions that enhance model capabilities without introducing computational overhead or training burden.
- →Reference-frame dominance caused by excessive self-attention to reference tokens suppresses motion generation in I2V models
- →DyMoS provides training-free, model-agnostic motion improvement through attention pathway rebalancing
- →The method maintains visual fidelity and image consistency while enabling dynamic motion control
- →Single scalar parameter allows continuous, user-adjustable control over motion strength without retraining
- →Results show consistent improvements across multiple state-of-the-art I2V architectures