AIBullisharXiv – CS AI · 4d ago7/10
🧠Researchers propose Hurwitz Quaternion Multiplicative Quantization (HQMQ), a calibration-free method for compressing KV caches in large language models using quaternion mathematics. The technique achieves 5x compression with minimal perplexity loss, matching full-precision performance at ~5 bits while outperforming existing quantization methods across five major model architectures.
🧠 Llama
AIBullisharXiv – CS AI · 4d ago7/10
🧠Researchers introduce Thinking as Compression (TaC), a novel approach that leverages language model reasoning traces as a natural context compression mechanism without requiring dedicated compression modules. The method demonstrates significant performance gains, outperforming existing compression baselines by 17-23% across long-context QA benchmarks at high compression ratios.
AIBullisharXiv – CS AI · 4d ago7/10
🧠Researchers present a systematic study of Attention-FFN Disaggregation (AFD), a technique that separates attention and expert layers across different GPU groups to optimize inference serving for Mixture-of-Experts language models. The framework demonstrates that AFD enables 4k tokens/s throughput on DeepSeek-V3.2 under strict latency constraints where traditional disaggregation approaches fail, providing design principles for scaling LLM infrastructure.
AIBullisharXiv – CS AI · 5d ago7/10
🧠Researchers introduce HiSpec, a hierarchical speculative decoding framework that accelerates large language model inference by using early-exit models for intermediate verification, achieving up to 2.01× throughput improvements without sacrificing accuracy.
AIBullisharXiv – CS AI · 5d ago7/10
🧠Researchers introduce Qrita, an efficient algorithm for Top-k and Top-p sampling in large language models that uses pivot-based truncation instead of sorting. The method achieves 1.4x throughput improvements with 50% less memory usage while maintaining identical output to traditional sorting approaches, and has been adopted as the default sampler in vLLM.
AIBullisharXiv – CS AI · May 127/10
🧠PARD-2 introduces a dual-mode speculative decoding framework that accelerates large language model inference by up to 6.94× through improved draft model training aligned with token acceptance rather than prediction accuracy. The advancement uses Confidence-Adaptive Token optimization to enable single draft models to operate in both target-dependent and target-independent modes, significantly outperforming existing methods like EAGLE-3.
🧠 Llama
AIBullisharXiv – CS AI · May 127/10
🧠Researchers introduce Entropy-informed Decoding (EDEN), a novel framework that optimizes how large language models generate text by dynamically adjusting computational effort based on output uncertainty. The method matches or exceeds the performance of traditional beam search while using fewer computational expansions, particularly improving results on complex tasks like mathematical reasoning and code generation.
AIBullisharXiv – CS AI · May 127/10
🧠SPECTRE is a new LLM serving framework that improves inference efficiency by repurposing underutilized smaller models as remote drafters for heavily-loaded large models through parallel speculative decoding. The system achieves up to 2.28× speedup on large models like Qwen3-235B while maintaining minimal interference to smaller models' native workloads.
AIBullisharXiv – CS AI · May 117/10
🧠Dooly is a new profiling framework that optimizes LLM inference simulation by reducing redundant profiling across different hardware and software configurations. By leveraging structural insights about operation dependencies, the system cuts profiling costs by over 56% while maintaining simulation accuracy within 5-8% error margins, addressing a critical bottleneck in LLM deployment optimization.
AIBullisharXiv – CS AI · May 117/10
🧠Researchers introduce MISA, an optimization technique that reduces computational costs in DeepSeek's sparse attention mechanism for large language models by treating indexer heads as a mixture-of-experts system. The method achieves 3.82x speedup on GPU inference while maintaining performance across benchmarks, addressing a key bottleneck in long-context LLM processing.
🏢 Nvidia
AINeutralarXiv – CS AI · May 117/10
🧠Researchers introduce KVFundaBench to expose a critical gap in KV cache compression evaluation: while retrieval tasks remain robust under compression, reasoning tasks degrade severely due to disrupted Chain-of-Thought coherence. They propose ShotKV, which preserves semantic integrity by treating few-shot examples as indivisible units, achieving 9-18% accuracy improvements on long-context tasks while reducing latency by 11%.
AIBullisharXiv – CS AI · May 97/10
🧠Researchers propose sparse prefix caching, a novel optimization technique for hybrid and recurrent LLM serving that stores exact states at checkpoint positions rather than caching entire token histories. The method uses dynamic programming to determine optimal checkpoint placement and demonstrates superior performance on real-world datasets while using fewer checkpoints than existing dense caching approaches.
AIBullisharXiv – CS AI · May 97/10
🧠Litespark-Inference introduces custom SIMD kernels that enable efficient large language model inference on standard consumer CPUs by exploiting ternary neural networks (weights constrained to -1, 0, +1), replacing floating-point multiplication with simple addition and subtraction. The solution achieves dramatic performance improvements—9.2x faster latency and 52x higher throughput on Apple Silicon—making AI workloads accessible to billions of underutilized personal computers.
AIBullisharXiv – CS AI · May 77/10
🧠Researchers introduce a queueing-theoretic framework that models LLM inference stability by accounting for both computational and GPU memory constraints from KV caching. The framework derives conditions for service stability and enables operators to calculate optimal cluster sizes for efficient GPU provisioning, with experimental validation showing predictions within 10% accuracy.
AIBullisharXiv – CS AI · May 77/10
🧠Researchers introduce PARSE, a speculative generation framework that accelerates large language model inference by verifying multiple prefix candidates in parallel rather than sequentially. The method achieves 1.25x to 4.3x throughput improvements over baseline models and up to 4.5x gains when combined with existing techniques like EAGLE-3, with minimal accuracy loss.
AIBullisharXiv – CS AI · May 47/10
🧠SAGA is a new distributed GPU scheduler that treats entire AI agent workflows as atomic units rather than individual inference calls, reducing task completion time by 1.64x compared to existing solutions. The system achieves this through workflow-aware scheduling, KV cache optimization, and fairness mechanisms, though with a tradeoff of 30% lower peak throughput suitable for latency-sensitive interactive deployments.
🏢 Meta
AIBullisharXiv – CS AI · Apr 207/10
🧠OjaKV introduces a novel framework for compressing key-value caches in large language models through online low-rank projection, addressing a critical memory bottleneck in long-context inference. The method combines selective full-rank storage for important tokens with adaptive compression for intermediate tokens, maintaining accuracy while reducing memory consumption without requiring model fine-tuning.
🧠 Llama
AIBullisharXiv – CS AI · Apr 207/10
🧠Researchers present a CPU-centric analysis of agentic AI systems, identifying bottlenecks in heterogeneous CPU-GPU architectures where most orchestration occurs on CPU. Two optimization methods—CPU-Aware Overlapped Micro-Batching and Mixed Agentic Scheduling—demonstrate significant latency reductions, addressing a critical infrastructure gap as agentic AI moves toward production deployment.
AIBullisharXiv – CS AI · Apr 157/10
🧠SpecBranch introduces a novel speculative decoding framework that leverages branch parallelism to accelerate large language model inference, achieving 1.8x to 4.5x speedups over standard auto-regressive decoding. The technique addresses serialization bottlenecks in existing speculative decoding methods by implementing parallel drafting branches with adaptive token lengths and rollback-aware orchestration.
AIBullisharXiv – CS AI · Apr 147/10
🧠Researchers introduce SpecMoE, a new inference system that applies speculative decoding to Mixture-of-Experts language models to improve computational efficiency. The approach achieves up to 4.30x throughput improvements while reducing memory and bandwidth requirements without requiring model retraining.
AIBullisharXiv – CS AI · Apr 147/10
🧠Researchers introduce SPEED-Bench, a comprehensive benchmark suite for evaluating Speculative Decoding (SD) techniques that accelerate LLM inference. The benchmark addresses critical gaps in existing evaluation methods by offering diverse semantic domains, throughput-oriented testing across multiple concurrency levels, and integration with production systems like vLLM and TensorRT-LLM, enabling more accurate real-world performance measurement.
AIBullisharXiv – CS AI · Apr 147/10
🧠IceCache is a new memory management technique for large language models that reduces KV cache memory consumption by 75% while maintaining 99% accuracy on long-sequence tasks. The method combines semantic token clustering with PagedAttention to intelligently offload cache data between GPU and CPU, addressing a critical bottleneck in LLM inference on resource-constrained hardware.
AIBullisharXiv – CS AI · Apr 137/10
🧠Researchers introduce CSAttention, a training-free sparse attention method that accelerates LLM inference by 4.6x for long-context applications. The technique optimizes the offline-prefill/online-decode workflow by precomputing query-centric lookup tables, enabling faster token generation without sacrificing accuracy even at 95% sparsity levels.
AIBullisharXiv – CS AI · Apr 107/10
🧠Researchers propose Symbolic Equivalence Partitioning, a novel inference-time selection method for code generation that uses symbolic execution and SMT constraints to identify correct solutions without expensive external verifiers. The approach improves accuracy on HumanEval+ by 10.3% and on LiveCodeBench by 17.1% at N=10 without requiring additional LLM inference.
AIBullisharXiv – CS AI · Apr 107/10
🧠Researchers present a new approach to General Matrix Multiplication (GEMM) using Space Filling Curves that automatically optimizes data movement across memory hierarchies without requiring platform-specific tuning. The method achieves up to 5.5x speedups over vendor libraries and demonstrates significant performance gains in LLM inference and distributed computing applications.