Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps
Researchers present RTPurbo, a method that transforms standard full-attention language models into efficient sparse models within just hundreds of training steps. By leveraging the observation that LLMs are intrinsically sparse, the approach achieves up to 9.36× speedup during prefill and 2.01× during decode at 1M context length while maintaining near-lossless accuracy.
RTPurbo addresses a critical bottleneck in deploying large language models at scale: the quadratic computational cost of full attention during long-context inference. Rather than redesigning models from scratch with sparse attention, the researchers discovered that standard full-attention models already exhibit sparse patterns that can be exploited with minimal additional training. This finding bridges the gap between training efficiency and inference performance, two typically competing objectives in LLM optimization.
The approach builds on three key empirical observations about how attention actually functions in pretrained models. Most attention heads operate locally and don't require full context awareness; the information needed for long-range token retrieval concentrates in low-dimensional subspaces, enabling efficient indexing with just 16 dimensions; and the number of tokens a query actually needs varies dynamically, making adaptive selection superior to fixed budgets. These insights diverge from common assumptions in the field and suggest that architectural constraints may be unnecessarily conservative.
For the AI infrastructure and development community, RTPurbo presents substantial practical value. Practitioners can convert existing full-attention models without expensive retraining from scratch, reducing both computational costs and carbon footprint. The speedups—particularly the 9.36× prefill improvement—directly translate to lower latency and reduced inference costs, improving the economics of long-context applications like document analysis, code understanding, and long-form reasoning.
The implications extend beyond individual applications. If sparse attention can be achieved with minimal fine-tuning rather than native sparse pretraining, it validates a paradigm shift: optimize models post-hoc rather than redesigning from inception. This could accelerate adoption of efficient inference across deployed models and encourage further investigation into hidden efficiency properties of standard architectures.
- →RTPurbo achieves up to 9.36× prefill speedup and 2.01× decode speedup at 1M context with only hundreds of training steps.
- →Full-attention LLMs contain intrinsic sparsity patterns that can be exploited without native sparse pretraining.
- →Only a small subset of attention heads requires full context processing; others can operate with sparse, dynamically-selected tokens.
- →Long-range retrieval operates in low-dimensional subspaces, enabling efficient 16-dimensional token indexing.
- →The method preserves near-lossless accuracy while delivering substantial efficiency gains for long-context inference.