#parallel-decoding News & Analysis

15 articles tagged with #parallel-decoding. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

15 articles

AIBullisharXiv – CS AI · Jun 257/10

🧠

Streaming-dLLM: Accelerating Diffusion LLMs via Suffix Pruning and Dynamic Decoding

Researchers introduce Streaming-dLLM, a training-free optimization framework that accelerates Diffusion Language Models by up to 68.2X through spatial suffix pruning and dynamic temporal decoding strategies. The approach maintains generation quality while addressing inherent inefficiencies in block-wise diffusion processes, representing a significant advance in making parallel decoding models more computationally practical.

AIBullisharXiv – CS AI · Jun 97/10

🧠

Unified Energy for Invariant and Independent Decoding in Diffusion Language Models

Researchers propose Unified Energy (Uni-E), a novel approach to improve parallel text generation in Diffusion Language Models by addressing token dependency and invariance issues. The method achieves exact computation without sampling-based estimation and demonstrates effectiveness across various model scales, narrowing the performance gap with traditional auto-regressive decoding.

AIBullisharXiv – CS AI · May 277/10

🧠

LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

Researchers introduce LocateAnything, a new vision-language model framework that uses Parallel Box Decoding to detect and localize objects simultaneously rather than sequentially, improving both inference speed and accuracy. The team curated a 138-million-sample dataset and demonstrated significant performance improvements across multiple benchmarks.

AIBullisharXiv – CS AI · May 117/10

🧠

Regulating Branch Parallelism in LLM Serving

Researchers introduce TAPER, an admission controller for managing parallel branch execution in LLM serving systems. The system dynamically regulates how many concurrent decoding branches are allowed per request step, balancing throughput gains against degradation to co-batched requests, achieving 1.77x improvement in goodput over conservative baselines.

AIBullisharXiv – CS AI · Apr 147/10

🧠

FS-DFM: Fast and Accurate Long Text Generation with Few-Step Diffusion Language Models

Researchers introduce FS-DFM, a discrete flow-matching model that generates long text 128x faster than standard diffusion models while maintaining quality parity. The breakthrough uses few-step sampling with teacher guidance distillation, achieving in 8 steps what previously required 1,024 evaluations.

🏢 Perplexity

AIBullisharXiv – CS AI · Jun 106/10

🧠

Attention-Discounted Adaptive Sampler for Masked Diffusion Language Models

Researchers propose ADAS, a training-free reranking algorithm that improves parallel token decoding in masked diffusion language models by using attention weights as soft penalties to avoid committing to correlated predictions simultaneously. The method achieves 9-10 percentage point improvements on benchmarks like GSM8K and HumanEval with minimal computational overhead, advancing the efficiency of faster language model inference.

AIBullisharXiv – CS AI · Jun 46/10

🧠

Supportive Token Revealing for Fast Diffusion Language Model Decoding

Researchers introduce AXON, a training-free module that improves parallel decoding efficiency in discrete diffusion language models by intelligently selecting which confident tokens to reveal first, reducing computational steps while maintaining or improving output quality.

AIBullisharXiv – CS AI · Jun 26/10

🧠

FLARE: Diffusion for Hybrid Language Model

Researchers introduce FLARE, a conversion framework that enables large language models with hybrid attention mechanisms to function as both autoregressive and diffusion models, addressing a key limitation in parallel decoding while maintaining model capability. The approach demonstrates competitive performance with existing diffusion language models while delivering throughput gains in concurrent serving scenarios.

AIBullisharXiv – CS AI · Jun 26/10

🧠

SimSD: Simple Speculative Decoding in Diffusion Language Models

Researchers propose SimSD, a novel speculative decoding algorithm that enables diffusion language models to achieve up to 7.46x faster inference speeds while maintaining generation quality. By introducing a plug-and-play masking strategy, SimSD addresses the fundamental incompatibility between diffusion models' bidirectional attention and token-level speculative verification, a technique proven effective for autoregressive models.

AIBullisharXiv – CS AI · Jun 16/10

🧠

Chatterbox-Flash: Prior-Calibrated Block Diffusion for Streaming Zero-Shot TTS

Researchers introduce Chatterbox-Flash, a zero-shot text-to-speech model combining block-diffusion decoding with streaming capabilities. The system addresses token distribution bias through prior-calibrated scoring and early-decoding schedules, achieving high-fidelity speech synthesis with low latency comparable to autoregressive systems.

AINeutralarXiv – CS AI · Jun 16/10

🧠

Stop the Flip-Flop: Context-Preserving Verification for Fast Revocable Diffusion Decoding

Researchers introduce COVER, a new verification technique for diffusion language models that eliminates inefficient token oscillations during parallel decoding. By using KV cache overrides to preserve context while selectively verifying tokens in a single forward pass, COVER accelerates inference while maintaining output quality.

AINeutralarXiv – CS AI · Apr 206/10

🧠

DepCap: Adaptive Block-Wise Parallel Decoding for Efficient Diffusion LM Inference

Researchers introduce DepCap, a training-free framework that optimizes diffusion language model (DLM) inference through adaptive block-wise parallel decoding. The method achieves up to 5.63× speedup by using cross-step signals to determine block boundaries and identifying conflict-free token subsets for safe parallel execution, maintaining quality while significantly accelerating inference.

AINeutralarXiv – CS AI · Apr 146/10

🧠

Parallelism and Generation Order in Masked Diffusion Language Models: Limits Today, Potential Tomorrow

Researchers evaluated eight large Masked Diffusion Language Models (up to 100B parameters) and found they still underperform comparable autoregressive models despite promises of parallel token generation. The study reveals MDLMs exhibit task-dependent decoding behavior and propose a Generate-then-Edit paradigm to improve performance while maintaining parallel processing efficiency.

AIBullisharXiv – CS AI · Mar 36/106

🧠

MetaState: Persistent Working Memory for Discrete Diffusion Language Models

Researchers introduce MetaState, a recurrent augmentation for discrete diffusion language models (dLLMs) that adds persistent working memory to improve text generation quality. The system addresses the 'Information Island' problem where intermediate representations are discarded between denoising steps, achieving improved accuracy on LLaDA-8B and Dream-7B models with minimal parameter overhead.

AIBullisharXiv – CS AI · Mar 36/104

🧠

AdaBlock-dLLM: Semantic-Aware Diffusion LLM Inference via Adaptive Block Size

Researchers introduce AdaBlock-dLLM, a training-free optimization technique for diffusion-based large language models that adaptively adjusts block sizes during inference based on semantic structure. The method addresses limitations in conventional fixed-block semi-autoregressive decoding, achieving up to 5.3% accuracy improvements under the same throughput budget.