#serving-efficiency News & Analysis

2 articles tagged with #serving-efficiency. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

2 articles

AIBullisharXiv – CS AI · Jun 57/10

🧠

QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving

QCFuse introduces a compressed-view query-aware selector for retrieval-augmented generation (RAG) systems that accelerates LLM serving by intelligently reusing cached key-value computations. The technique achieves 1.7x speedup over full prefill and 1.5x over existing baselines while maintaining full-prefill quality, addressing a critical bottleneck in RAG deployment.

AIBullisharXiv – CS AI · Jun 116/10

🧠

INFRAMIND: Infrastructure-Aware Multi-Agent Orchestration

INFRAMIND is a new framework that optimizes multi-agent LLM orchestration by making real-time infrastructure state (queue depths, cache pressure, latencies) central to routing and scheduling decisions. Using reinforcement learning, the system dynamically adjusts model selection and pipeline topology based on GPU cluster load, achieving up to 7.6% accuracy gains and 7x latency reduction while maintaining 99.9% SLO compliance under high load.