#speculative-decoding News & Analysis

42 articles tagged with #speculative-decoding. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

42 articles

AIBullisharXiv – CS AI · Jun 237/10

🧠

Delay-Adaptive Speculation Control for Low-Latency Edge-Cloud LLM Inference

Researchers develop a delay-adaptive algorithm for optimizing speculative decoding in distributed LLM inference across edge-cloud systems. The study proves optimal draft length follows a finite threshold policy and introduces UCB-SpecStop, an online control algorithm that reduces per-token latency by up to 22.4% compared to existing methods while adapting to varying network conditions.

🧠 Llama

AIBullisharXiv – CS AI · Jun 197/10

🧠

SafeSpec: Fast and Safe LLM via Dynamic Reflective Sampling

SafeSpec is a new speculative inference framework that integrates safety guardrails directly into LLM decoding acceleration without sacrificing speed gains. The method uses a lightweight safety head to detect unsafe outputs and applies reflective sampling to recover safe continuations, achieving a 15% reduction in attack success rates while maintaining 2.06x speedup on standard workloads.

AIBullisharXiv – CS AI · Jun 117/10

🧠

VIA-SD: Verification via Intra-Model Routing for Speculative Decoding

Researchers propose VIA-SD, a multi-tier verification framework for speculative decoding that uses a lightweight slim-verifier to handle medium-confidence tokens instead of always invoking full model verification. The approach reduces rejection rates by 10-22% and achieves 10-20% speedup improvements over existing speculative decoding methods while maintaining compatibility with current frameworks.

AIBullisharXiv – CS AI · Jun 97/10

🧠

WhiFlash: Accelerating Speculative Decoding with Token-Level Cross-Paradigm Routing

WhiFlash introduces a novel speculative decoding method that combines autoregressive and diffusion-based drafting models through token-level routing, achieving up to 69.6% throughput improvements over existing approaches. The system uses lightweight controllers to dynamically switch between drafting paradigms based on per-token conditions, addressing a key bottleneck in LLM inference efficiency.

AIBullisharXiv – CS AI · Jun 47/10

🧠

Speculative Thinking: Enhancing Small-Model Reasoning with Large Model Guidance at Inference Time

Researchers introduce Speculative Thinking, a training-free framework that leverages larger AI models to guide smaller ones during inference, improving reasoning accuracy while reducing output length. The method achieves a 6.2% accuracy boost on mathematical reasoning tasks for a 1.5B parameter model with 15.7% shorter outputs, demonstrating efficiency gains without costly retraining.

AIBullisharXiv – CS AI · Jun 47/10

🧠

SSSD: Simply-Scalable Speculative Decoding

Researchers introduce SSSD, a training-free method for accelerating Large Language Model inference that reduces latency by up to 2.9x through n-gram matching and hardware-aware speculation. The approach matches performance of existing trained methods while eliminating deployment complexity, data preparation, and maintenance overhead.

AIBullisharXiv – CS AI · Jun 27/10

🧠

TAPS: Target-Aware Prefix Tree Selection for Diffusion-Drafted Speculative Decoding

Researchers introduce TAPS, a target-aware prefix selection method that improves speculative decoding by optimizing how draft trees are verified in diffusion models. The technique achieves up to 7.9x speedup over standard autoregressive decoding and outperforms competing methods by 1.36-1.74x, addressing a fundamental inefficiency where existing approaches verify unreachable token sequences.

AIBullisharXiv – CS AI · Jun 27/10

🧠

SENSE: Semantic Embedding Navigation with Soft-gated Evaluation for Retrieval-based Speculative Decoding

SENSE is a new retrieval-based speculative decoding method that accelerates LLM inference by using semantic embeddings instead of lexical matching to retrieve candidate tokens. The approach achieves up to 3.26x speedup while maintaining generation quality, outperforming existing methods on LLaMA and Qwen models.

AIBullisharXiv – CS AI · Jun 27/10

🧠

BudgetDraft: Acceptance-Aware Multi-View Training for Sparse-KV Speculative Decoding

BudgetDraft is a new training method for sparse-KV speculative decoding that enables faster language model inference under memory constraints. By training drafters to handle multiple KV cache budgets simultaneously, the technique achieves up to 6.55x speedup on mid-to-long context inference while maintaining acceptance rates and reducing GPU memory usage.

AIBullisharXiv – CS AI · May 277/10

🧠

HiSpec: Hierarchical Speculative Decoding for LLMs

Researchers introduce HiSpec, a hierarchical speculative decoding framework that accelerates large language model inference by using early-exit models for intermediate verification, achieving up to 2.01× throughput improvements without sacrificing accuracy.

AIBullisharXiv – CS AI · May 127/10

🧠

SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference

SPECTRE is a new LLM serving framework that improves inference efficiency by repurposing underutilized smaller models as remote drafters for heavily-loaded large models through parallel speculative decoding. The system achieves up to 2.28× speedup on large models like Qwen3-235B while maintaining minimal interference to smaller models' native workloads.

AIBullisharXiv – CS AI · May 127/10

🧠

PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding

PARD-2 introduces a dual-mode speculative decoding framework that accelerates large language model inference by up to 6.94× through improved draft model training aligned with token acceptance rather than prediction accuracy. The advancement uses Confidence-Adaptive Token optimization to enable single draft models to operate in both target-dependent and target-independent modes, significantly outperforming existing methods like EAGLE-3.

🧠 Llama

AIBullisharXiv – CS AI · May 127/10

🧠

BubbleSpec: Turning Long-Tail Bubbles into Speculative Rollout Drafts for Synchronous Reinforcement Learning

Researchers introduce BubbleSpec, a framework that optimizes Reinforcement Learning training for Large Language Models by exploiting idle GPU time during synchronous rollouts. The method uses speculative decoding to pre-generate draft outputs during wait periods, achieving 50% reduction in decoding steps and up to 1.8x throughput improvement while maintaining mathematical exactness.

AIBullisharXiv – CS AI · May 117/10

🧠

CASCADE: Context-Aware Relaxation for Speculative Image Decoding

Researchers have developed CASCADE, a novel speculative decoding technique that accelerates autoregressive image generation by up to 3.6x through identifying and exploiting redundancies in neural network representations. The method addresses a critical bottleneck in image synthesis by reducing draft token rejection rates without requiring model retraining, advancing the efficiency of text-to-image AI systems.

AIBullisharXiv – CS AI · May 77/10

🧠

Parallel Prefix Verification for Speculative Generation

Researchers introduce PARSE, a speculative generation framework that accelerates large language model inference by verifying multiple prefix candidates in parallel rather than sequentially. The method achieves 1.25x to 4.3x throughput improvements over baseline models and up to 4.5x gains when combined with existing techniques like EAGLE-3, with minimal accuracy loss.

AIBullisharXiv – CS AI · Apr 157/10

🧠

SpecBranch: Speculative Decoding via Hybrid Drafting and Rollback-Aware Branch Parallelism

SpecBranch introduces a novel speculative decoding framework that leverages branch parallelism to accelerate large language model inference, achieving 1.8x to 4.5x speedups over standard auto-regressive decoding. The technique addresses serialization bottlenecks in existing speculative decoding methods by implementing parallel drafting branches with adaptive token lengths and rollback-aware orchestration.

AIBullisharXiv – CS AI · Apr 147/10

🧠

SpecMoE: A Fast and Efficient Mixture-of-Experts Inference via Self-Assisted Speculative Decoding

Researchers introduce SpecMoE, a new inference system that applies speculative decoding to Mixture-of-Experts language models to improve computational efficiency. The approach achieves up to 4.30x throughput improvements while reducing memory and bandwidth requirements without requiring model retraining.

AIBullisharXiv – CS AI · Apr 147/10

🧠

SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding

Researchers introduce SPEED-Bench, a comprehensive benchmark suite for evaluating Speculative Decoding (SD) techniques that accelerate LLM inference. The benchmark addresses critical gaps in existing evaluation methods by offering diverse semantic domains, throughput-oriented testing across multiple concurrency levels, and integration with production systems like vLLM and TensorRT-LLM, enabling more accurate real-world performance measurement.

AIBullisharXiv – CS AI · Mar 167/10

🧠

When Drafts Evolve: Speculative Decoding Meets Online Learning

Researchers introduce OnlineSpec, a framework that uses online learning to continuously improve draft models in speculative decoding for large language model inference acceleration. The approach leverages verification feedback to evolve draft models dynamically, achieving up to 24% speedup improvements across seven benchmarks and three foundation models.

AIBullisharXiv – CS AI · Mar 127/10

🧠

MoE-SpAc: Efficient MoE Inference Based on Speculative Activation Utility in Heterogeneous Edge Scenarios

Researchers introduce MoE-SpAc, a new framework for efficient Mixture-of-Experts model inference on edge devices that achieves 42% improvement over existing baselines. The system uses speculative decoding as a memory management tool and demonstrates 4.04x average speedup across benchmarks.

AIBullisharXiv – CS AI · Mar 117/10

🧠

Efficiently Aligning Draft Models via Parameter- and Data-Efficient Adaptation

Researchers introduce Efficient Draft Adaptation (EDA), a framework that significantly reduces the cost of adapting draft models for speculative decoding when target LLMs are fine-tuned. EDA achieves superior performance through decoupled architecture, data regeneration, and smart sample selection while requiring substantially less training resources than full retraining.

AIBullisharXiv – CS AI · Mar 97/10

🧠

SpecFuse: Ensembling Large Language Models via Next-Segment Prediction

Researchers introduce SpecEM, a new training-free framework for ensembling large language models that dynamically adjusts each model's contribution based on real-time performance. The system uses speculative decoding principles and online feedback mechanisms to improve collaboration between different LLMs, showing consistent performance improvements across multiple benchmark datasets.

AIBullisharXiv – CS AI · Mar 47/103

🧠

Nightjar: Dynamic Adaptive Speculative Decoding for Large Language Models Serving

Nightjar is a new adaptive speculative decoding framework for large language models that dynamically adjusts to system load conditions. It achieves 27.29% higher throughput and up to 20.18% lower latency by intelligently enabling or disabling speculation based on workload demands.

AIBullisharXiv – CS AI · Mar 37/104

🧠

Bridging Draft Policy Misalignment: Group Tree Optimization for Speculative Decoding

Researchers introduce Group Tree Optimization (GTO), a new training method that improves speculative decoding for large language models by aligning draft model training with actual decoding policies. GTO achieves 7.4% better acceptance length and 7.7% additional speedup over existing state-of-the-art methods across multiple benchmarks and LLMs.

AIBullisharXiv – CS AI · Mar 37/104

🧠

Overcoming Joint Intractability with Lossless Hierarchical Speculative Decoding

Researchers have developed Hierarchical Speculative Decoding (HSD), a new method that significantly improves AI inference speed while maintaining accuracy by solving joint intractability problems in verification processes. The technique shows over 12% performance gains when integrated with existing frameworks like EAGLE-3, establishing new state-of-the-art efficiency standards.

Page 1 of 2Next →