#inference-acceleration News & Analysis

21 articles tagged with #inference-acceleration. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

21 articles

AIBullisharXiv – CS AI · Jun 27/10

🧠

SENSE: Semantic Embedding Navigation with Soft-gated Evaluation for Retrieval-based Speculative Decoding

SENSE is a new retrieval-based speculative decoding method that accelerates LLM inference by using semantic embeddings instead of lexical matching to retrieve candidate tokens. The approach achieves up to 3.26x speedup while maintaining generation quality, outperforming existing methods on LLaMA and Qwen models.

AIBullisharXiv – CS AI · Jun 17/10

🧠

DARTS: Distribution-Aware Active Rollout Trajectory Shaping for Accelerating LLM Reinforcement Learning

Researchers propose DARTS, a novel approach to accelerate large language model reinforcement learning by reshaping the rollout distribution toward conciseness and certainty, reducing computational inefficiencies caused by long-tail response lengths. The method achieves up to 1.77x speedup through distribution-aware trajectory sampling without sacrificing model performance.

AIBullishCrypto Briefing · May 287/10

🧠

Groq raises $750M Series E at $6.9B valuation to scale AI inference hardware

Groq has secured $750 million in Series E funding at a $6.9 billion valuation, demonstrating sustained investor confidence in AI infrastructure development. The capital will support scaling of the company's specialized AI inference hardware, reflecting broader market momentum toward dedicated AI acceleration solutions.

AIBullisharXiv – CS AI · May 77/10

🧠

EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation

EdgeRazor introduces a lightweight quantization framework that compresses large language models to 1.88-bit precision while maintaining performance superior to existing 3-bit methods. The approach combines mixed-precision quantization with knowledge distillation and achieves up to 15.1× faster decoding with 80% storage reduction, requiring significantly lower computational training budgets than comparable techniques.

AIBullisharXiv – CS AI · Apr 157/10

🧠

OSC: Hardware Efficient W4A4 Quantization via Outlier Separation in Channel Dimension

Researchers present OSC, a hardware-efficient framework that addresses the challenge of deploying Large Language Models with 4-bit quantization by intelligently separating activation outliers into a high-precision processing path while maintaining low-precision computation for standard values. The technique achieves 1.78x speedup over standard 8-bit approaches while limiting accuracy degradation to under 2.2% on state-of-the-art models.

AINeutralarXiv – CS AI · Mar 177/10

🧠

Accelerating Suffix Jailbreak attacks with Prefix-Shared KV-cache

Researchers developed Prefix-Shared KV Cache (PSKV), a new technique that accelerates jailbreak attacks on Large Language Models by 40% while reducing memory usage by 50%. The method optimizes the red-teaming process by sharing cached prefixes across multiple attack attempts, enabling more efficient parallel inference without compromising attack success rates.

AIBullisharXiv – CS AI · Mar 177/10

🧠

Masked Auto-Regressive Variational Acceleration: Fast Inference Makes Practical Reinforcement Learning

Researchers introduce MARVAL, a distillation framework that accelerates masked auto-regressive diffusion models by compressing inference into a single step while enabling practical reinforcement learning applications. The method achieves 30x speedup on ImageNet with comparable quality, making RL post-training feasible for the first time with these models.

AIBullisharXiv – CS AI · Mar 127/10

🧠

ES-dLLM: Efficient Inference for Diffusion Large Language Models by Early-Skipping

Researchers developed ES-dLLM, a training-free inference acceleration framework that speeds up diffusion large language models by selectively skipping tokens in early layers based on importance scoring. The method achieves 5.6x to 16.8x speedup over vanilla implementations while maintaining generation quality, offering a promising alternative to autoregressive models.

🏢 Nvidia

AIBullisharXiv – CS AI · Mar 37/104

🧠

Bridging Draft Policy Misalignment: Group Tree Optimization for Speculative Decoding

Researchers introduce Group Tree Optimization (GTO), a new training method that improves speculative decoding for large language models by aligning draft model training with actual decoding policies. GTO achieves 7.4% better acceptance length and 7.7% additional speedup over existing state-of-the-art methods across multiple benchmarks and LLMs.

AIBullisharXiv – CS AI · Feb 277/105

🧠

AngelSlim: A more accessible, comprehensive, and efficient toolkit for large model compression

Tencent Hunyuan team introduces AngelSlim, a comprehensive toolkit for large model compression featuring quantization, speculative decoding, and pruning techniques. The toolkit includes the first industrially viable 2-bit large model (HY-1.8B-int2) and achieves 1.8x to 2.0x throughput gains while maintaining output quality.

AINeutralarXiv – CS AI · Jun 236/10

🧠

On the Expressive Power of Weight Quantization in Large Language Models

Researchers establish theoretical limits on weight quantization in large language models, identifying 1.58-bit as the minimum precision threshold before expressive collapse occurs. The study demonstrates that model performance degrades polynomially as quantization bits decrease, providing theoretical foundations for optimizing model compression and inference acceleration techniques.

AINeutralarXiv – CS AI · Jun 236/10

🧠

On the Sparsity-Storage-Accuracy Tradeoff in Parsimoniously Activated Dictionary Learning

Researchers present a theoretical framework for parsimoniously activated dictionary learning (PADL) that constrains the number of active dictionary atoms rather than using traditional element-wise sparsity. The work establishes a probabilistic interpretation of PADL, derives analytical tradeoffs between sparsity, storage, and accuracy, and demonstrates practical improvements in vision and vision-language model inference.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Correlation Is Not Enough: Embedding Human Metadata for Individual Causal Discovery

Researchers demonstrate that pretrained biomedical language models fail catastrophically at cross-domain discrimination, assigning high similarity scores (0.76-0.92) to unrelated concepts. They propose BODHI, a contrastive learning approach that improves domain separation 2.3x while maintaining correlation accuracy, and show that optimized inference achieves 133x latency reduction on specialized hardware.

AINeutralarXiv – CS AI · Jun 56/10

🧠

Mechanistic Insights into Functional Sparsity in Multimodal LLMs via CoRe Heads

Researchers have identified a structural property in Multimodal Large Language Models called functional sparsity, discovering specialized attention heads (CoRe heads) that efficiently extract relevant visual information from complex contexts. This mechanistic insight demonstrates that only the top 5% of these heads are critical for multimodal reasoning, suggesting significant potential for model optimization and inference acceleration without performance loss.

AIBullisharXiv – CS AI · Jun 26/10

🧠

Consistency Deep Equilibrium Models

Researchers introduce Consistency Deep Equilibrium Models (C-DEQ), a novel framework that accelerates inference in Deep Equilibrium Models by leveraging consistency distillation to achieve 2-20× accuracy improvements under few-step inference budgets. This advancement addresses a critical bottleneck in DEQs—their slow inference speed—while maintaining the memory efficiency that makes them attractive for deep learning applications.

AIBullisharXiv – CS AI · Apr 66/10

🧠

Efficient3D: A Unified Framework for Adaptive and Debiased Token Reduction in 3D MLLMs

Researchers have developed Efficient3D, a framework that accelerates 3D Multimodal Large Language Models (MLLMs) while maintaining accuracy through adaptive token pruning. The system uses a Debiased Visual Token Importance Estimator and Adaptive Token Rebalancing to reduce computational overhead without sacrificing performance, showing +2.57% CIDEr improvement on benchmarks.

AIBullisharXiv – CS AI · Mar 176/10

🧠

OxyGen: Unified KV Cache Management for Vision-Language-Action Models under Multi-Task Parallelism

Researchers propose OxyGen, a unified KV cache management system for Vision-Language-Action Models that enables efficient multi-task parallelism in embodied AI agents. The system achieves up to 3.7x speedup by sharing computational resources across tasks and eliminating redundant processing of shared observations.

AIBullisharXiv – CS AI · Mar 37/107

🧠

Attn-QAT: 4-Bit Attention With Quantization-Aware Training

Researchers introduce Attn-QAT, the first systematic approach to 4-bit quantization-aware training for attention mechanisms in AI models. The method enables stable FP4 computation on emerging GPUs and delivers up to 1.5x speedup on RTX 5090 while maintaining model quality across diffusion and language models.

AIBullisharXiv – CS AI · Mar 37/1010

🧠

TriMoE: Augmenting GPU with AMX-Enabled CPU and DIMM-NDP for High-Throughput MoE Inference via Offloading

TriMoE introduces a novel GPU-CPU-NDP architecture that optimizes large Mixture-of-Experts model inference by strategically mapping hot, warm, and cold experts to their optimal compute units. The system leverages AMX-enabled CPUs and includes bottleneck-aware scheduling, achieving up to 2.83x performance improvements over existing solutions.

AIBullisharXiv – CS AI · Mar 37/107

🧠

LFPO: Likelihood-Free Policy Optimization for Masked Diffusion Models

Researchers propose Likelihood-Free Policy Optimization (LFPO), a new framework for improving Diffusion Large Language Models by bypassing likelihood computation issues that plague existing methods. LFPO uses geometric velocity rectification to optimize denoising logits directly, achieving better performance on code and reasoning tasks while reducing inference time by 20%.

AIBullishHugging Face Blog · Jan 305/104

🧠

Accelerate StarCoder with 🤗 Optimum Intel on Xeon: Q8/Q4 and Speculative Decoding

The article discusses optimizing StarCoder performance on Intel Xeon processors using Hugging Face's Optimum Intel library. It covers quantization techniques (Q8/Q4) and speculative decoding methods to accelerate inference speed for the code generation model.