AIBullisharXiv – CS AI · 18h ago7/10
🧠
Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps
Researchers present RTPurbo, a method that transforms standard full-attention language models into efficient sparse models within just hundreds of training steps. By leveraging the observation that LLMs are intrinsically sparse, the approach achieves up to 9.36× speedup during prefill and 2.01× during decode at 1M context length while maintaining near-lossless accuracy.