9 articles tagged with #throughput. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AIBullisharXiv โ CS AI ยท Mar 267/10
๐ง Researchers developed ODMA, a new memory allocation strategy that improves Large Language Model serving performance on memory-constrained accelerators by up to 27%. The technique addresses bandwidth limitations in LPDDR systems through adaptive bucket partitioning and dynamic generation-length prediction.
AINeutralarXiv โ CS AI ยท Mar 127/10
๐ง Researchers conducted comprehensive benchmarks of LLM inference on AMD Instinct MI325X GPUs, testing models from 235B to 1 trillion parameters. The study reveals that architecture-aware optimization is critical, with different model types requiring specific configurations for optimal performance on AMD hardware.
๐ง Llama
AIBullishMarkTechPost ยท Mar 117/10
๐ง NVIDIA has released Nemotron 3 Super, a 120 billion parameter open-source AI model designed for multi-agent applications. The hybrid Mamba-Attention MoE model delivers 5x higher throughput and bridges the gap between proprietary frontier models and transparent open-source alternatives.
๐ข Nvidia
AIBullisharXiv โ CS AI ยท Mar 67/10
๐ง Researchers introduce AMV-L, a new memory management framework for long-running LLM systems that uses utility-based lifecycle management instead of traditional time-based retention. The system improves throughput by 3.1x and reduces latency by up to 4.7x while maintaining retrieval quality by controlling memory working-set size rather than just retention time.
AIBullisharXiv โ CS AI ยท Mar 47/102
๐ง Researchers propose SUN (Shared Use of Next-token Prediction), a novel approach for multi-LLM serving that enables cross-model sharing of decode execution by decomposing transformers into separate prefill and decode modules. The system achieves up to 2.0x throughput improvement per GPU while maintaining accuracy comparable to full fine-tuning, with a quantized version (QSUN) providing additional 45% speedup.
AIBullisharXiv โ CS AI ยท Mar 47/103
๐ง Nightjar is a new adaptive speculative decoding framework for large language models that dynamically adjusts to system load conditions. It achieves 27.29% higher throughput and up to 20.18% lower latency by intelligently enabling or disabling speculation based on workload demands.
AIBullisharXiv โ CS AI ยท Mar 36/104
๐ง OrbitFlow is a new KV cache management system for long-context LLM serving that uses adaptive memory allocation and fine-grained optimization to improve performance. The system achieves up to 66% better SLO attainment and 3.3x higher throughput by dynamically managing GPU memory usage during token generation.
AIBullisharXiv โ CS AI ยท Mar 26/1017
๐ง Researchers developed a data-driven pipeline to optimize GPU efficiency for distributed LLM adapter serving, achieving sub-5% throughput estimation error while running 90x faster than full benchmarking. The system uses a Digital Twin, machine learning models, and greedy placement algorithms to minimize GPU requirements while serving hundreds of adapters concurrently.
AINeutralGoogle Research Blog ยท Feb 113/107
๐ง This appears to be a research article focused on algorithmic optimization for scheduling systems with time-varying capacity constraints. The work addresses theoretical approaches to maximizing throughput in dynamic environments where system capacity changes over time.