18 articles tagged with #llm-inference. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AIBullisharXiv β CS AI Β· 2d ago7/10
π§ SpecBranch introduces a novel speculative decoding framework that leverages branch parallelism to accelerate large language model inference, achieving 1.8x to 4.5x speedups over standard auto-regressive decoding. The technique addresses serialization bottlenecks in existing speculative decoding methods by implementing parallel drafting branches with adaptive token lengths and rollback-aware orchestration.
AIBullisharXiv β CS AI Β· 3d ago7/10
π§ Researchers introduce SpecMoE, a new inference system that applies speculative decoding to Mixture-of-Experts language models to improve computational efficiency. The approach achieves up to 4.30x throughput improvements while reducing memory and bandwidth requirements without requiring model retraining.
AIBullisharXiv β CS AI Β· 3d ago7/10
π§ Researchers introduce SPEED-Bench, a comprehensive benchmark suite for evaluating Speculative Decoding (SD) techniques that accelerate LLM inference. The benchmark addresses critical gaps in existing evaluation methods by offering diverse semantic domains, throughput-oriented testing across multiple concurrency levels, and integration with production systems like vLLM and TensorRT-LLM, enabling more accurate real-world performance measurement.
AIBullisharXiv β CS AI Β· 3d ago7/10
π§ IceCache is a new memory management technique for large language models that reduces KV cache memory consumption by 75% while maintaining 99% accuracy on long-sequence tasks. The method combines semantic token clustering with PagedAttention to intelligently offload cache data between GPU and CPU, addressing a critical bottleneck in LLM inference on resource-constrained hardware.
AIBullisharXiv β CS AI Β· 4d ago7/10
π§ Researchers introduce CSAttention, a training-free sparse attention method that accelerates LLM inference by 4.6x for long-context applications. The technique optimizes the offline-prefill/online-decode workflow by precomputing query-centric lookup tables, enabling faster token generation without sacrificing accuracy even at 95% sparsity levels.
AIBullisharXiv β CS AI Β· Apr 107/10
π§ Researchers propose Symbolic Equivalence Partitioning, a novel inference-time selection method for code generation that uses symbolic execution and SMT constraints to identify correct solutions without expensive external verifiers. The approach improves accuracy on HumanEval+ by 10.3% and on LiveCodeBench by 17.1% at N=10 without requiring additional LLM inference.
AIBullisharXiv β CS AI Β· Apr 107/10
π§ Researchers present a new approach to General Matrix Multiplication (GEMM) using Space Filling Curves that automatically optimizes data movement across memory hierarchies without requiring platform-specific tuning. The method achieves up to 5.5x speedups over vendor libraries and demonstrates significant performance gains in LLM inference and distributed computing applications.
AIBullisharXiv β CS AI Β· Apr 67/10
π§ Researchers analyzed data movement patterns in large-scale Mixture of Experts (MoE) language models (200B-1000B parameters) to optimize inference performance. Their findings led to architectural modifications achieving 6.6x speedups on wafer-scale GPUs and up to 1.25x improvements on existing systems through better expert placement algorithms.
π’ Hugging Face
AIBullisharXiv β CS AI Β· Mar 277/10
π§ Researchers developed Model2Kernel, a system that automatically detects memory safety bugs in CUDA kernels used for large language model inference. The system discovered 353 previously unknown bugs across popular platforms like vLLM and Hugging Face with only nine false positives.
π’ Hugging Face
AIBullisharXiv β CS AI Β· Mar 167/10
π§ Researchers introduce OnlineSpec, a framework that uses online learning to continuously improve draft models in speculative decoding for large language model inference acceleration. The approach leverages verification feedback to evolve draft models dynamically, achieving up to 24% speedup improvements across seven benchmarks and three foundation models.
AINeutralarXiv β CS AI Β· Mar 127/10
π§ Researchers conducted comprehensive benchmarks of LLM inference on AMD Instinct MI325X GPUs, testing models from 235B to 1 trillion parameters. The study reveals that architecture-aware optimization is critical, with different model types requiring specific configurations for optimal performance on AMD hardware.
π§ Llama
AIBullisharXiv β CS AI Β· Mar 117/10
π§ Researchers have developed Zipage, a new high-concurrency inference engine for large language models that uses Compressed PagedAttention to solve memory bottlenecks. The system achieves 95% performance of full KV inference engines while delivering over 2.1x speedup on mathematical reasoning tasks.
AIBullisharXiv β CS AI Β· Mar 67/10
π§ Researchers developed a memory management system for multi-agent AI systems on edge devices that reduces memory requirements by 4x through 4-bit quantization and eliminates redundant computation by persisting KV caches to disk. The solution reduces time-to-first-token by up to 136x while maintaining minimal impact on model quality across three major language model architectures.
π’ Perplexityπ§ Llama
AI Γ CryptoBullisharXiv β CS AI Β· Mar 56/10
π€Researchers developed a multi-dimensional quality scoring framework for decentralized LLM inference networks that evaluates output quality across multiple dimensions including semantic quality and query-output alignment. The framework integrates with Proof of Quality (PoQ) mechanisms to provide better incentive alignment and defense against adversarial attacks in distributed AI compute networks.
AINeutralarXiv β CS AI Β· 3d ago6/10
π§ ConfigSpec introduces a profiling-based framework for optimizing distributed LLM inference across edge-cloud systems using speculative decoding. The research reveals that no single configuration can simultaneously optimize throughput, cost efficiency, and energy efficiencyβrequiring dynamic, device-aware configuration selection rather than fixed deployments.
AINeutralarXiv β CS AI Β· 3d ago6/10
π§ A-IO addresses critical memory-bound bottlenecks in LLM deployment on NPU platforms like Ascend 910B by tackling the 'Model Scaling Paradox' and limitations of current speculative decoding techniques. The research reveals that static single-model deployment strategies and kernel synchronization overhead significantly constrain inference performance on heterogeneous accelerators.
AIBullisharXiv β CS AI Β· Mar 176/10
π§ Researchers propose Outcome-Aware Tool Selection (OATS), a method to improve tool selection in LLM inference gateways by interpolating tool embeddings toward successful query centroids without adding latency. The approach improves tool selection accuracy on benchmarks while maintaining single-digit millisecond CPU processing times.
AIBullisharXiv β CS AI Β· Mar 37/106
π§ Researchers have developed AloePri, the first privacy-preserving LLM inference method designed for industrial applications. The system uses collaborative obfuscation to protect input/output data while maintaining 96.5-100% accuracy and resisting state-of-the-art attacks, successfully tested on a 671B parameter model.