ScalingAttention: Discovering Intrinsic Sparse Attention Topology for Video Diffusion Transformers
Researchers introduce ScalingAttention, a training-free framework that optimizes video diffusion transformers by discovering stable, sparse attention patterns encoded in model weights rather than computing them dynamically. The method achieves up to 1.90X speedup while maintaining superior video generation fidelity, addressing a critical computational bottleneck in AI-generated video production.
Video diffusion transformers have emerged as powerful tools for high-fidelity generation, but their reliance on 3D full attention mechanisms creates a fundamental efficiency problem: computational complexity scales quadratically with sequence length. ScalingAttention addresses this by making a crucial observation about how attention patterns function. Rather than treating attention masks as purely input-dependent variables that must be recomputed for each prompt, the researchers discovered that the high-attention regions for individual heads stabilize into predictable, prompt-agnostic topologies that can be extracted from model weights offline.
This insight enables a two-pronged approach. WEST (Weight-Encoded Sparse Topology) eliminates runtime search overhead by pre-computing optimal sparsity masks, while FAST (Fidelity-Aware Sensitivity Tuning) allows adaptive per-head sparsity adjustment during inference based on quality requirements. The training-free nature of the framework makes it immediately applicable to existing models without retraining costs, a significant practical advantage over dynamic pruning approaches that incur substantial memory fragmentation and computational overhead.
The 1.90X end-to-end speedup on Wan2.1 models combined with improved fidelity represents a meaningful advance for production video generation systems. This research directly impacts inference efficiency for content creators, streaming platforms, and enterprises deploying video AI, reducing both latency and computational resource requirements. Hardware co-design with bit-wise block-sparse kernels demonstrates attention to practical deployment constraints.
Future developments should focus on how these topology insights transfer across different model architectures and whether similar patterns exist in other transformer-based generative tasks beyond video.
- βScalingAttention achieves 1.90X speedup in video diffusion transformers by leveraging weight-encoded sparse attention patterns
- βThe framework operates training-free, extracting stable attention topologies offline rather than computing them dynamically per input
- βDecoupled topology discovery and sparsity control enables both runtime efficiency and adaptive fidelity optimization
- βHardware-aligned sparse kernels ensure practical acceleration rather than theoretical speedup improvements
- βMethod addresses quadratic computational bottleneck that has limited practical deployment of high-quality video generation