y0news
← Feed
Back to feed
🧠 AI NeutralImportance 6/10

Characterizing Performance-Energy Trade-offs of Large Language Models in Multi-Request Workflows

arXiv – CS AI|Md. Monzurul Amin Ifath, Israat Haque|
🤖AI Summary

Researchers present the first systematic study of performance-energy trade-offs in multi-request LLM inference workflows, using NVIDIA A100 GPUs and vLLM/Parrot serving systems. The study identifies batch size as the most impactful optimization lever, though effectiveness varies by workload type, and reveals that workflow-aware scheduling can reduce energy consumption under power constraints.

Analysis

This research addresses a critical gap in LLM infrastructure optimization by moving beyond single-request benchmarks to examine real-world multi-request workflows where latency and energy costs compound significantly. As organizations deploy LLMs for document summarization, search copilots, and multi-agent systems, understanding how these interdependent requests interact becomes essential for cost-effective operations.

The study systematically evaluates how engineering decisions impact both performance and energy consumption across different workflow patterns. By testing on production-grade serving systems (vLLM and Parrot) rather than theoretical models, the researchers provide insights grounded in actual deployment constraints. Their finding that batch size effectiveness is workload-dependent challenges the common assumption that larger batches always improve efficiency—sequential summarization and multi-agent coding show diminishing returns despite theoretical batching benefits.

For infrastructure operators and AI development teams, these findings have immediate practical implications. GPU power capping emerges as a reliable but modest lever for energy savings, while output length demonstrates linear energy scaling with limited optimization potential. The comparative analysis between vLLM's engine-level optimizations and Parrot's workflow-aware scheduling provides a roadmap for choosing appropriate serving infrastructure based on specific constraints—throughput prioritization versus strict power budgets.

As LLM deployment costs become increasingly competitive factors in model selection and serving strategy, this research equips developers with evidence-based guidance for system design. Future work should extend these findings to newer models, heterogeneous hardware environments, and dynamic request patterns reflecting real production scenarios.

Key Takeaways
  • Batch size optimization yields workflow-dependent benefits, excelling with shared prompts but providing limited gains for sequential and agentic patterns
  • GPU power capping delivers predictable but modest energy savings without proportional performance degradation
  • vLLM maintains superior GPU utilization for decode-heavy workloads while Parrot's scheduling achieves lower energy consumption under strict power constraints
  • Output length induces linear energy scaling with minimal efficiency gains, making it an unreliable optimization target
  • Multi-request workflow dependencies reveal optimization patterns invisible in single-request benchmarking studies
Mentioned in AI
Companies
Nvidia
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles