#inference-speedup News & Analysis

2 articles tagged with #inference-speedup. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

2 articles

AIBullisharXiv – CS AI · Jun 97/10

🧠

Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

Researchers present RTPurbo, a method that transforms standard full-attention language models into efficient sparse models within just hundreds of training steps. By leveraging the observation that LLMs are intrinsically sparse, the approach achieves up to 9.36× speedup during prefill and 2.01× during decode at 1M context length while maintaining near-lossless accuracy.

AIBullisharXiv – CS AI · Apr 147/10

🧠

Why Smaller Is Slower? Dimensional Misalignment in Compressed LLMs

Researchers identify dimensional misalignment as a critical bottleneck in compressed large language models, where parameter reduction fails to improve GPU performance due to hardware-incompatible tensor dimensions. They propose GAC (GPU-Aligned Compression), a new optimization method that achieves up to 1.5× speedup while maintaining model quality by ensuring hardware-friendly dimensions.

🧠 Llama