AIBullisharXiv – CS AI · Apr 67/10
🧠Researchers analyzed data movement patterns in large-scale Mixture of Experts (MoE) language models (200B-1000B parameters) to optimize inference performance. Their findings led to architectural modifications achieving 6.6x speedups on wafer-scale GPUs and up to 1.25x improvements on existing systems through better expert placement algorithms.
🏢 Hugging Face
AIBullisharXiv – CS AI · Mar 277/10
🧠Researchers developed Model2Kernel, a system that automatically detects memory safety bugs in CUDA kernels used for large language model inference. The system discovered 353 previously unknown bugs across popular platforms like vLLM and Hugging Face with only nine false positives.
🏢 Hugging Face
AIBullisharXiv – CS AI · Mar 167/10
🧠Researchers introduce OnlineSpec, a framework that uses online learning to continuously improve draft models in speculative decoding for large language model inference acceleration. The approach leverages verification feedback to evolve draft models dynamically, achieving up to 24% speedup improvements across seven benchmarks and three foundation models.
AINeutralarXiv – CS AI · Mar 127/10
🧠Researchers conducted comprehensive benchmarks of LLM inference on AMD Instinct MI325X GPUs, testing models from 235B to 1 trillion parameters. The study reveals that architecture-aware optimization is critical, with different model types requiring specific configurations for optimal performance on AMD hardware.
🧠 Llama
AIBullisharXiv – CS AI · Mar 117/10
🧠Researchers have developed Zipage, a new high-concurrency inference engine for large language models that uses Compressed PagedAttention to solve memory bottlenecks. The system achieves 95% performance of full KV inference engines while delivering over 2.1x speedup on mathematical reasoning tasks.
AIBullisharXiv – CS AI · Mar 67/10
🧠Researchers developed a memory management system for multi-agent AI systems on edge devices that reduces memory requirements by 4x through 4-bit quantization and eliminates redundant computation by persisting KV caches to disk. The solution reduces time-to-first-token by up to 136x while maintaining minimal impact on model quality across three major language model architectures.
🏢 Perplexity🧠 Llama
AI × CryptoBullisharXiv – CS AI · Mar 56/10
🤖Researchers developed a multi-dimensional quality scoring framework for decentralized LLM inference networks that evaluates output quality across multiple dimensions including semantic quality and query-output alignment. The framework integrates with Proof of Quality (PoQ) mechanisms to provide better incentive alignment and defense against adversarial attacks in distributed AI compute networks.
AINeutralarXiv – CS AI · 1d ago6/10
🧠UniScale introduces a unified framework that combines model routing and test-time scaling to optimize large language model inference, balancing quality and computational cost. The system uses online learning via contextual multi-armed bandits to adapt inference policies dynamically, achieving fine-grained performance improvements over existing decoupled approaches.
AINeutralarXiv – CS AI · 1d ago6/10
🧠A technical study reveals that batch-1 LLM inference on edge devices and robots is constrained by GPU launch overhead rather than memory bandwidth alone, with faster GPUs like the H100 achieving only 27% of theoretical peak bandwidth compared to 81% on slower L4 GPUs. Quantization techniques show inconsistent speedups, suggesting that hardware improvements don't automatically translate to latency gains without addressing software bottlenecks in physical AI deployments.
$BNB$ADA🏢 Nvidia
AINeutralarXiv – CS AI · 4d ago6/10
🧠Researchers present methods for improving how large language models generate diverse pools of creative ideas during parallel inference without relying on seed examples. Their findings show that semantic direction stratification—organizing generations across different semantic directions with a single planning call—outperforms anchor-dependent baselines while maintaining quality and computational efficiency.
AIBullisharXiv – CS AI · 5d ago6/10
🧠Researchers propose LaneRoPE, a novel technique that enables multiple parallel language model sequences to coordinate and share information during generation, improving reasoning accuracy without significant architectural changes or inference overhead.
AIBullisharXiv – CS AI · 5d ago6/10
🧠EvoSpec introduces a dynamic framework for accelerating Large Language Model inference through real-time adaptation of vocabulary and parameters in speculative decoding. By addressing the vocabulary bottleneck that causes performance degradation in specialized domains, EvoSpec achieves 1.13x speedup improvements over static baselines while reducing memory overhead by 27%.
AIBearisharXiv – CS AI · 5d ago6/10
🧠A research study reveals that NPUs (Neural Processing Units) on mobile devices don't consistently accelerate LLM inference as expected, with CPUs outperforming NPUs on compute-intensive prefill operations and NPUs providing only marginal speedups on memory-bound decode stages. The findings challenge assumptions about heterogeneous mobile computing and suggest current NPU designs require architectural improvements for on-device AI workloads.
AINeutralarXiv – CS AI · 5d ago6/10
🧠Researchers propose SelfJudge, a new method for accelerating large language model inference through self-supervised judge verification that eliminates the need for human annotations. The approach trains verifiers to assess whether token substitutions preserve semantic meaning, enabling faster inference without sacrificing accuracy across diverse NLP tasks.
AINeutralarXiv – CS AI · 6d ago6/10
🧠Researchers propose a RAG-based framework leveraging Large Language Models to detect and mitigate Carpet-Bombing DDoS attacks in Software-Defined Networks. The system achieves high detection accuracy without traditional supervised training, addressing a critical vulnerability in SDN's centralized architecture through intelligent traffic behavior classification.
AIBullisharXiv – CS AI · 6d ago6/10
🧠Researchers propose PIPO (Pair-In, Pair-Out), a novel technique that combines input compression and multi-token prediction to accelerate large language model inference. The method eliminates expensive verification steps while achieving up to 2.64x speedups in first-token latency and demonstrating significant improvements on reasoning benchmarks.
AINeutralarXiv – CS AI · May 126/10
🧠Researchers present a diagnostic framework for evaluating KV cache eviction selectors in large language models, identifying three failure modes and demonstrating that value-aware ranking combined with evidence recovery achieves 72.6% accuracy on positive-margin test cases. The work addresses a critical bottleneck in long-context LLM inference by revealing why compression strategies succeed or fail.
AIBullisharXiv – CS AI · May 126/10
🧠Researchers present KV-RM, a runtime optimization that manages KV-cache memory movement in static-graph LLM decoders, achieving better throughput and reduced latency variability without sacrificing the predictability benefits of static graph execution. The approach decouples logical KV histories from physical storage through a block pager and merge-staged transport mechanism, demonstrating practical improvements on multi-GPU systems.
🏢 Nvidia
AIBullisharXiv – CS AI · May 96/10
🧠Researchers propose a reinforcement learning-based policy for routing intermediate reasoning steps across language models of varying sizes, reducing inference costs while maintaining accuracy on math benchmarks. The method uses threshold calibration to balance performance and efficiency without requiring large process reward models, outperforming handcrafted routing strategies.
AINeutralarXiv – CS AI · May 46/10
🧠A technical study comparing Nvidia and Apple Silicon for running large language models locally reveals fundamental architectural trade-offs: Nvidia achieves higher throughput through specialized quantization but faces memory constraints requiring aggressive model compression, while Apple's unified memory architecture scales more efficiently with superior energy performance. The research highlights ecosystem fragmentation as a major barrier for consumer adoption of datacenter-scale AI inference.
🏢 Nvidia
AINeutralarXiv – CS AI · May 16/10
🧠Researchers introduce PAD-Rec, a lightweight module that optimizes speculative decoding for LLM-based recommendation systems by incorporating position-aware embeddings. The approach achieves up to 3.1x speedup in inference while preserving recommendation quality, addressing the latency bottleneck in generative list-wise recommendations.
AINeutralarXiv – CS AI · Apr 146/10
🧠ConfigSpec introduces a profiling-based framework for optimizing distributed LLM inference across edge-cloud systems using speculative decoding. The research reveals that no single configuration can simultaneously optimize throughput, cost efficiency, and energy efficiency—requiring dynamic, device-aware configuration selection rather than fixed deployments.
AINeutralarXiv – CS AI · Apr 146/10
🧠A-IO addresses critical memory-bound bottlenecks in LLM deployment on NPU platforms like Ascend 910B by tackling the 'Model Scaling Paradox' and limitations of current speculative decoding techniques. The research reveals that static single-model deployment strategies and kernel synchronization overhead significantly constrain inference performance on heterogeneous accelerators.
AIBullisharXiv – CS AI · Mar 176/10
🧠Researchers propose Outcome-Aware Tool Selection (OATS), a method to improve tool selection in LLM inference gateways by interpolating tool embeddings toward successful query centroids without adding latency. The approach improves tool selection accuracy on benchmarks while maintaining single-digit millisecond CPU processing times.
AIBullisharXiv – CS AI · Mar 37/106
🧠Researchers have developed AloePri, the first privacy-preserving LLM inference method designed for industrial applications. The system uses collaborative obfuscation to protect input/output data while maintaining 96.5-100% accuracy and resisting state-of-the-art attacks, successfully tested on a 671B parameter model.