y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#llm-inference News & Analysis

48 articles tagged with #llm-inference. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

48 articles
AIBullisharXiv – CS AI · 4d ago7/10
🧠

Hurwitz Quaternion Multiplicative Quantization for KV Cache Compression

Researchers propose Hurwitz Quaternion Multiplicative Quantization (HQMQ), a calibration-free method for compressing KV caches in large language models using quaternion mathematics. The technique achieves 5x compression with minimal perplexity loss, matching full-precision performance at ~5 bits while outperforming existing quantization methods across five major model architectures.

🧠 Llama
AIBullisharXiv – CS AI · 4d ago7/10
🧠

Thinking as Compression: Your Reasoning Model is Secretly a Context Compressor

Researchers introduce Thinking as Compression (TaC), a novel approach that leverages language model reasoning traces as a natural context compression mechanism without requiring dedicated compression modules. The method demonstrates significant performance gains, outperforming existing compression baselines by 17-23% across long-context QA benchmarks at high compression ratios.

AIBullisharXiv – CS AI · 4d ago7/10
🧠

How Far Can Disaggregation Go? A Design-Space Exploration of Attention-FFN Disaggregation for Efficient MoE LLM Serving

Researchers present a systematic study of Attention-FFN Disaggregation (AFD), a technique that separates attention and expert layers across different GPU groups to optimize inference serving for Mixture-of-Experts language models. The framework demonstrates that AFD enables 4k tokens/s throughput on DeepSeek-V3.2 under strict latency constraints where traditional disaggregation approaches fail, providing design principles for scaling LLM infrastructure.

AIBullisharXiv – CS AI · 5d ago7/10
🧠

HiSpec: Hierarchical Speculative Decoding for LLMs

Researchers introduce HiSpec, a hierarchical speculative decoding framework that accelerates large language model inference by using early-exit models for intermediate verification, achieving up to 2.01× throughput improvements without sacrificing accuracy.

AIBullisharXiv – CS AI · 5d ago7/10
🧠

Qrita: High-performance Top-k and Top-p using Pivot-based Truncation and Selection

Researchers introduce Qrita, an efficient algorithm for Top-k and Top-p sampling in large language models that uses pivot-based truncation instead of sorting. The method achieves 1.4x throughput improvements with 50% less memory usage while maintaining identical output to traditional sorting approaches, and has been adopted as the default sampler in vLLM.

AIBullisharXiv – CS AI · May 127/10
🧠

PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding

PARD-2 introduces a dual-mode speculative decoding framework that accelerates large language model inference by up to 6.94× through improved draft model training aligned with token acceptance rather than prediction accuracy. The advancement uses Confidence-Adaptive Token optimization to enable single draft models to operate in both target-dependent and target-independent modes, significantly outperforming existing methods like EAGLE-3.

🧠 Llama
AIBullisharXiv – CS AI · May 127/10
🧠

Entropy-informed Decoding: Adaptive Information-Driven Branching

Researchers introduce Entropy-informed Decoding (EDEN), a novel framework that optimizes how large language models generate text by dynamically adjusting computational effort based on output uncertainty. The method matches or exceeds the performance of traditional beam search while using fewer computational expansions, particularly improving results on complex tasks like mathematical reasoning and code generation.

AIBullisharXiv – CS AI · May 127/10
🧠

SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference

SPECTRE is a new LLM serving framework that improves inference efficiency by repurposing underutilized smaller models as remote drafters for heavily-loaded large models through parallel speculative decoding. The system achieves up to 2.28× speedup on large models like Qwen3-235B while maintaining minimal interference to smaller models' native workloads.

AIBullisharXiv – CS AI · May 117/10
🧠

Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation

Dooly is a new profiling framework that optimizes LLM inference simulation by reducing redundant profiling across different hardware and software configurations. By leveraging structural insights about operation dependencies, the system cuts profiling costs by over 56% while maintaining simulation accuracy within 5-8% error margins, addressing a critical bottleneck in LLM deployment optimization.

AIBullisharXiv – CS AI · May 117/10
🧠

MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference

Researchers introduce MISA, an optimization technique that reduces computational costs in DeepSeek's sparse attention mechanism for large language models by treating indexer heads as a mixture-of-experts system. The method achieves 3.82x speedup on GPU inference while maintaining performance across benchmarks, addressing a key bottleneck in long-context LLM processing.

🏢 Nvidia
AINeutralarXiv – CS AI · May 117/10
🧠

Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression

Researchers introduce KVFundaBench to expose a critical gap in KV cache compression evaluation: while retrieval tasks remain robust under compression, reasoning tasks degrade severely due to disrupted Chain-of-Thought coherence. They propose ShotKV, which preserves semantic integrity by treating few-shot examples as indivisible units, achieving 9-18% accuracy improvements on long-context tasks while reducing latency by 11%.

AIBullisharXiv – CS AI · May 97/10
🧠

Sparse Prefix Caching for Hybrid and Recurrent LLM Serving

Researchers propose sparse prefix caching, a novel optimization technique for hybrid and recurrent LLM serving that stores exact states at checkpoint positions rather than caching entire token histories. The method uses dynamic programming to determine optimal checkpoint placement and demonstrates superior performance on real-world datasets while using fewer checkpoints than existing dense caching approaches.

AIBullisharXiv – CS AI · May 97/10
🧠

Litespark Inference on Consumer CPUs: Custom SIMD Kernels for Ternary Neural Networks

Litespark-Inference introduces custom SIMD kernels that enable efficient large language model inference on standard consumer CPUs by exploiting ternary neural networks (weights constrained to -1, 0, +1), replacing floating-point multiplication with simple addition and subtraction. The solution achieves dramatic performance improvements—9.2x faster latency and 52x higher throughput on Apple Silicon—making AI workloads accessible to billions of underutilized personal computers.

AIBullisharXiv – CS AI · May 77/10
🧠

A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints

Researchers introduce a queueing-theoretic framework that models LLM inference stability by accounting for both computational and GPU memory constraints from KV caching. The framework derives conditions for service stability and enables operators to calculate optimal cluster sizes for efficient GPU provisioning, with experimental validation showing predictions within 10% accuracy.

AIBullisharXiv – CS AI · May 77/10
🧠

Parallel Prefix Verification for Speculative Generation

Researchers introduce PARSE, a speculative generation framework that accelerates large language model inference by verifying multiple prefix candidates in parallel rather than sequentially. The method achieves 1.25x to 4.3x throughput improvements over baseline models and up to 4.5x gains when combined with existing techniques like EAGLE-3, with minimal accuracy loss.

AIBullisharXiv – CS AI · May 47/10
🧠

SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters

SAGA is a new distributed GPU scheduler that treats entire AI agent workflows as atomic units rather than individual inference calls, reducing task completion time by 1.64x compared to existing solutions. The system achieves this through workflow-aware scheduling, KV cache optimization, and fairness mechanisms, though with a tradeoff of 30% lower peak throughput suitable for latency-sensitive interactive deployments.

🏢 Meta
AIBullisharXiv – CS AI · Apr 207/10
🧠

OjaKV: Context-Aware Online Low-Rank KV Cache Compression

OjaKV introduces a novel framework for compressing key-value caches in large language models through online low-rank projection, addressing a critical memory bottleneck in long-context inference. The method combines selective full-rank storage for important tokens with adaptive compression for intermediate tokens, maintaining accuracy while reducing memory consumption without requiring model fine-tuning.

🧠 Llama
AIBullisharXiv – CS AI · Apr 207/10
🧠

Towards Understanding, Analyzing, and Optimizing Agentic AI Execution: A CPU-Centric Perspective

Researchers present a CPU-centric analysis of agentic AI systems, identifying bottlenecks in heterogeneous CPU-GPU architectures where most orchestration occurs on CPU. Two optimization methods—CPU-Aware Overlapped Micro-Batching and Mixed Agentic Scheduling—demonstrate significant latency reductions, addressing a critical infrastructure gap as agentic AI moves toward production deployment.

AIBullisharXiv – CS AI · Apr 157/10
🧠

SpecBranch: Speculative Decoding via Hybrid Drafting and Rollback-Aware Branch Parallelism

SpecBranch introduces a novel speculative decoding framework that leverages branch parallelism to accelerate large language model inference, achieving 1.8x to 4.5x speedups over standard auto-regressive decoding. The technique addresses serialization bottlenecks in existing speculative decoding methods by implementing parallel drafting branches with adaptive token lengths and rollback-aware orchestration.

AIBullisharXiv – CS AI · Apr 147/10
🧠

SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding

Researchers introduce SPEED-Bench, a comprehensive benchmark suite for evaluating Speculative Decoding (SD) techniques that accelerate LLM inference. The benchmark addresses critical gaps in existing evaluation methods by offering diverse semantic domains, throughput-oriented testing across multiple concurrency levels, and integration with production systems like vLLM and TensorRT-LLM, enabling more accurate real-world performance measurement.

AIBullisharXiv – CS AI · Apr 147/10
🧠

IceCache: Memory-efficient KV-cache Management for Long-Sequence LLMs

IceCache is a new memory management technique for large language models that reduces KV cache memory consumption by 75% while maintaining 99% accuracy on long-sequence tasks. The method combines semantic token clustering with PagedAttention to intelligently offload cache data between GPU and CPU, addressing a critical bottleneck in LLM inference on resource-constrained hardware.

AIBullisharXiv – CS AI · Apr 137/10
🧠

CSAttention: Centroid-Scoring Attention for Accelerating LLM Inference

Researchers introduce CSAttention, a training-free sparse attention method that accelerates LLM inference by 4.6x for long-context applications. The technique optimizes the offline-prefill/online-decode workflow by precomputing query-centric lookup tables, enabling faster token generation without sacrificing accuracy even at 95% sparsity levels.

AIBullisharXiv – CS AI · Apr 107/10
🧠

Inference-Time Code Selection via Symbolic Equivalence Partitioning

Researchers propose Symbolic Equivalence Partitioning, a novel inference-time selection method for code generation that uses symbolic execution and SMT constraints to identify correct solutions without expensive external verifiers. The approach improves accuracy on HumanEval+ by 10.3% and on LiveCodeBench by 17.1% at N=10 without requiring additional LLM inference.

AIBullisharXiv – CS AI · Apr 107/10
🧠

Space Filling Curves is All You Need: Communication-Avoiding Matrix Multiplication Made Simple

Researchers present a new approach to General Matrix Multiplication (GEMM) using Space Filling Curves that automatically optimizes data movement across memory hierarchies without requiring platform-specific tuning. The method achieves up to 5.5x speedups over vendor libraries and demonstrates significant performance gains in LLM inference and distributed computing applications.

Page 1 of 2Next →