🧠 AI⚪ NeutralImportance 6/10

Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs

arXiv – CS AI|Yixuan Mei, Zikun Li, Zixuan Chen, Shiqi Pan, Mengdi Wu, Xupeng Miao, Zhihao Jia, K. V. Rashmi|May 7, 2026 at 04:00 AM

🤖AI Summary

Coral is a new multi-LLM serving system that optimizes resource allocation across heterogeneous cloud GPUs to reduce inference costs by up to 2.79x. The system uses a two-stage decomposition algorithm that maintains optimal performance while reducing optimization time from hours to seconds, enabling dynamic adaptation to changing demand and resource availability.

Analysis

Coral addresses a critical infrastructure challenge facing cloud AI providers: efficiently serving multiple large language models simultaneously across diverse and older-generation GPU hardware. The fragmentation of the LLM market means providers must support numerous models concurrently, yet traditional optimization approaches struggle with the computational complexity of jointly allocating resources across heterogeneous hardware while maintaining service quality. This creates substantial operational overhead and cost inefficiency.

The system's significance stems from its practical approach to a real infrastructure bottleneck. Rather than requiring top-tier, expensive GPUs, Coral makes mid-tier and legacy hardware economically viable for LLM serving by intelligently routing requests and dynamically adjusting model replicas. The two-stage decomposition breakthrough is particularly valuable—reducing solve time from hours to tens of seconds enables real-time adaptation to fluctuating demand patterns and hardware availability, which is essential for production cloud environments.

For the AI infrastructure market, Coral's results have immediate implications. Cloud providers can substantially reduce operational costs while improving resource utilization, translating to lower pricing for end users and improved margins for providers. This democratizes access to LLM serving infrastructure by making it economically feasible on cheaper commodity hardware. The 2.39x improvement in goodput under resource scarcity demonstrates particular value for edge deployments and regions with limited GPU availability.

The research indicates a broader shift toward optimizing AI inference economics rather than raw performance. As the LLM market matures and commoditizes, operational efficiency becomes a key competitive differentiator. Organizations should monitor whether cloud providers adopt similar heterogeneity-aware scheduling approaches and how this impacts pricing and service tier offerings across major platforms.

Key Takeaways

→Coral reduces multi-LLM serving costs by up to 2.79x through optimized resource allocation on heterogeneous GPUs.
→A two-stage decomposition algorithm cuts optimization time from hours to seconds, enabling real-time adaptation to demand shifts.
→The system makes older and mid-tier GPU hardware economically viable for concurrent LLM inference, reducing hardware cost requirements.
→Performance gains are most significant under resource scarcity conditions, improving goodput by up to 2.39x.
→This infrastructure optimization addresses the practical challenge of serving fragmented LLM markets in production cloud environments.