AIBullisharXiv – CS AI · 8h ago7/10
🧠
DyLLM: Efficient Diffusion LLM Inference via Saliency-based Token Selection and Partial Attention
Researchers introduce DyLLM, a training-free inference framework that accelerates diffusion language model decoding by up to 9.6x by selectively computing only salient tokens rather than processing entire sequences at each step. The approach identifies important tokens through attention context similarity and reuses cached activations for stable tokens, maintaining baseline accuracy across benchmarks.