y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#gpu-utilization News & Analysis

4 articles tagged with #gpu-utilization. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

4 articles
AIBullisharXiv – CS AI · Jun 57/10
🧠

RedKnot: Efficient Long-Context LLM Serving with Head-Aware KV Reuse and SegPagedAttention

RedKnot is a new KV cache management system for large language models that optimizes memory efficiency by treating cache differently across attention heads rather than as a uniform block. This head-aware approach enables better resource utilization, higher serving concurrency, and improved scalability without requiring model retraining.

AINeutralarXiv – CS AI · Apr 146/10
🧠

Characterizing Performance-Energy Trade-offs of Large Language Models in Multi-Request Workflows

Researchers present the first systematic study of performance-energy trade-offs in multi-request LLM inference workflows, using NVIDIA A100 GPUs and vLLM/Parrot serving systems. The study identifies batch size as the most impactful optimization lever, though effectiveness varies by workload type, and reveals that workflow-aware scheduling can reduce energy consumption under power constraints.

🏢 Nvidia
AIBullisharXiv – CS AI · Mar 96/10
🧠

MoEless: Efficient MoE LLM Serving via Serverless Computing

Researchers introduce MoEless, a serverless framework for serving Mixture-of-Experts Large Language Models that addresses expert load imbalance issues. The system reduces inference latency by 43% and costs by 84% compared to existing solutions by using predictive load balancing and optimized expert scaling strategies.