Rank-Factorized Implicit Neural Bias: Scaling Super-Resolution Transformer with FlashAttention
Researchers propose Rank-Factorized Implicit Neural Bias (RIB), a novel positional encoding method that replaces relative positional bias in Super-Resolution Transformers, enabling compatibility with FlashAttention hardware acceleration. This breakthrough achieves significant performance gains (35.63 dB PSNR on Urban100×2) while reducing training and inference time by 2.1× and 2.9× respectively, addressing a critical scalability bottleneck in SR model development.
Super-Resolution Transformers have struggled with a fundamental architectural constraint: their reliance on relative positional bias (RPB) makes them incompatible with efficient attention kernels like FlashAttention, creating a severe computational penalty that prevents scaling. This paper addresses that gap by introducing RIB, which encodes positional information through low-rank implicit neural representations concatenated with pixel tokens rather than added to attention scores. This architectural shift transforms what was previously an element-wise bias operation into a dot-product operation, enabling hardware-accelerated attention computation.
The problem reflects a broader challenge in deep learning: domain-specific architectural constraints often diverge from hardware optimization opportunities. While natural language processing and vision tasks have fully exploited Transformer scalability, image super-resolution remained stuck optimizing small receptive fields due to computational overhead. RIB breaks this bottleneck by enabling 96×96 attention windows—substantially larger than previous SR Transformers could achieve—while jointly scaling training patch sizes and dataset sizes.
The performance metrics demonstrate tangible benefits beyond speed improvements. The 35.63 dB PSNR result on Urban100×2 represents state-of-the-art performance while the 2.1× training speedup and 2.9× inference speedup create meaningful advantages for practitioners. The introduction of convolutional local attention and cyclic window strategies further maximizes the benefits of expanded receptive fields. For AI researchers and practitioners in image processing, this work removes a significant practical barrier to scaling SR models, making larger, more capable architectures economically viable for both research and deployment.
- →RIB replaces relative positional bias with low-rank implicit neural representations, enabling FlashAttention compatibility in SR Transformers
- →Achieves 2.1× faster training and 2.9× faster inference while improving image quality metrics
- →Enables 96×96 attention windows, substantially expanding the receptive field capabilities of SR models
- →Addresses fundamental scalability bottleneck preventing Super-Resolution Transformers from matching architectural advantages seen in other domains
- →Methodology combines architectural innovation with hardware-level optimization, demonstrating synergy between algorithm and accelerator design