#llm-serving News & Analysis

9 articles tagged with #llm-serving. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

9 articles

AIBullisharXiv – CS AI · 5d ago7/10

🧠

A Policy-Driven Runtime Layer for Agentic LLM Serving

Researchers propose a new runtime layer architecture for serving multi-agent LLM systems, positioned between application frameworks and inference engines. The approach enables unified policy management for cross-cutting concerns like caching and fairness, with CacheSage demonstrating 13-37% improvements in cache hit rates and 12-29% reductions in time-to-first-token latency.

AIBullisharXiv – CS AI · May 127/10

🧠

FlashSVD v1.5: Making Low-Rank Transformers Inference Actually Fast

FlashSVD v1.5 addresses a critical gap between theoretical and practical performance gains in SVD-compressed transformer inference, delivering up to 2.55x speedup through runtime optimization rather than algorithmic improvements alone. The work demonstrates that low-rank compression benefits require co-designed inference systems to translate parameter reduction into actual serving speed improvements.

AIBullisharXiv – CS AI · May 117/10

🧠

Regulating Branch Parallelism in LLM Serving

Researchers introduce TAPER, an admission controller for managing parallel branch execution in LLM serving systems. The system dynamically regulates how many concurrent decoding branches are allowed per request step, balancing throughput gains against degradation to co-batched requests, achieving 1.77x improvement in goodput over conservative baselines.

AIBullisharXiv – CS AI · May 17/10

🧠

Predictive Multi-Tier Memory Management for KV Cache in Large-Scale GPU Inference

Researchers present a unified system for optimizing KV cache memory management in large-scale GPU inference, addressing three critical inefficiencies through architecture-aware sizing, multi-tier memory hierarchy spanning CPU to NVMe storage, and predictive eviction policies. The approach achieves 70-84% cache hit rates and projects 1.4-2.1x improvements in latency and 1.7-2.9x throughput gains while reducing costs by 47% compared to existing solutions.

AIBullisharXiv – CS AI · Mar 177/10

🧠

ICaRus: Identical Cache Reuse for Efficient Multi Model Inference

ICaRus introduces a novel architecture enabling multiple AI models to share identical Key-Value (KV) caches, addressing memory explosion issues in multi-model inference systems. The solution achieves up to 11.1x lower latency and 3.8x higher throughput by allowing cross-model cache reuse while maintaining comparable accuracy to task-specific fine-tuned models.

AIBullisharXiv – CS AI · Mar 47/102

🧠

SUN: Shared Use of Next-token Prediction for Efficient Multi-LLM Disaggregated Serving

Researchers propose SUN (Shared Use of Next-token Prediction), a novel approach for multi-LLM serving that enables cross-model sharing of decode execution by decomposing transformers into separate prefill and decode modules. The system achieves up to 2.0x throughput improvement per GPU while maintaining accuracy comparable to full fine-tuning, with a quantized version (QSUN) providing additional 45% speedup.

AIBullisharXiv – CS AI · Feb 277/107

🧠

LLMServingSim 2.0: A Unified Simulator for Heterogeneous and Disaggregated LLM Serving Infrastructure

Researchers have released LLMServingSim 2.0, a unified simulator that models the complex interactions between heterogeneous hardware and disaggregated software in large language model serving infrastructures. The simulator achieves 0.97% average error compared to real deployments while maintaining 10-minute simulation times for complex configurations.

$NEAR

AIBullisharXiv – CS AI · May 96/10

🧠

VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?

VibeServe introduces an AI-driven approach to LLM serving infrastructure that automatically generates specialized system stacks for different workloads rather than relying on single general-purpose designs. The system matches vLLM performance in standard deployment scenarios while significantly outperforming existing solutions in non-standard cases, suggesting a paradigm shift toward generation-time specialization in infrastructure software.

AINeutralarXiv – CS AI · May 76/10

🧠

Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs

Coral is a new multi-LLM serving system that optimizes resource allocation across heterogeneous cloud GPUs to reduce inference costs by up to 2.79x. The system uses a two-stage decomposition algorithm that maintains optimal performance while reducing optimization time from hours to seconds, enabling dynamic adaptation to changing demand and resource availability.