y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#vllm News & Analysis

14 articles tagged with #vllm. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

14 articles
AIBullisharXiv – CS AI · 5d ago7/10
🧠

Lodestar: An Online-Learning LLM Inference Router

Researchers introduce Lodestar, a machine learning-based request routing system that dynamically assigns large language model inference tasks to GPU instances in distributed clusters. The system achieves up to 4.38x improvements in latency metrics compared to existing heuristics by continuously learning optimal routing strategies in real-time.

AIBullisharXiv – CS AI · May 287/10
🧠

Grounded Cache Routing for Retrieval-Augmented Generation: When Is It Safe to Reuse an Answer?

GroundedCache proposes a safety-first framework for reusing cached answers in retrieval-augmented generation systems by validating four conditions before serving cached responses. The system achieves near-zero unsafe-served rates (0-1.5%) across benchmarks while maintaining minimal latency overhead, addressing critical vulnerabilities in current caching approaches that can serve incorrect answers.

AIBullisharXiv – CS AI · May 277/10
🧠

Qrita: High-performance Top-k and Top-p using Pivot-based Truncation and Selection

Researchers introduce Qrita, an efficient algorithm for Top-k and Top-p sampling in large language models that uses pivot-based truncation instead of sorting. The method achieves 1.4x throughput improvements with 50% less memory usage while maintaining identical output to traditional sorting approaches, and has been adopted as the default sampler in vLLM.

AIBullisharXiv – CS AI · May 127/10
🧠

Uncovering Intra-expert Activation Sparsity for Efficient Mixture-of-Expert Model Execution

Researchers demonstrate that Mixture of Experts (MoE) models contain substantial underutilized sparsity within individual experts that can be exploited without modifying model parameters. By implementing intra-expert activation sparsity in vLLM, they achieve up to 2.5x speedup in MoE layer execution, offering a practical optimization path for efficient large language model deployment.

AIBullisharXiv – CS AI · Apr 207/10
🧠

Ragged Paged Attention: A High-Performance and Flexible LLM Inference Kernel for TPU

Researchers introduced Ragged Paged Attention (RPA), a specialized inference kernel optimized for Google's TPUs that enables efficient large language model deployment. The innovation addresses the GPU-centric design of existing LLM serving systems by implementing fine-grained tiling and custom software pipelines, achieving up to 86% memory bandwidth utilization on TPU hardware.

🧠 Llama
AIBullisharXiv – CS AI · Apr 147/10
🧠

SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding

Researchers introduce SPEED-Bench, a comprehensive benchmark suite for evaluating Speculative Decoding (SD) techniques that accelerate LLM inference. The benchmark addresses critical gaps in existing evaluation methods by offering diverse semantic domains, throughput-oriented testing across multiple concurrency levels, and integration with production systems like vLLM and TensorRT-LLM, enabling more accurate real-world performance measurement.

AIBullisharXiv – CS AI · Mar 277/10
🧠

Model2Kernel: Model-Aware Symbolic Execution For Safe CUDA Kernels

Researchers developed Model2Kernel, a system that automatically detects memory safety bugs in CUDA kernels used for large language model inference. The system discovered 353 previously unknown bugs across popular platforms like vLLM and Hugging Face with only nine false positives.

🏢 Hugging Face
AIBullisharXiv – CS AI · Mar 177/10
🧠

Justitia: Fair and Efficient Scheduling of Task-parallel LLM Agents with Selective Pampering

Justitia is a new scheduling system for task-parallel LLM agents that optimizes GPU server performance through selective resource allocation based on completion order prediction. The system uses memory-centric cost quantification and virtual-time fair queuing to achieve both efficiency and fairness in LLM serving environments.

🏢 Meta
AIBullisharXiv – CS AI · May 96/10
🧠

VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?

VibeServe introduces an AI-driven approach to LLM serving infrastructure that automatically generates specialized system stacks for different workloads rather than relying on single general-purpose designs. The system matches vLLM performance in standard deployment scenarios while significantly outperforming existing solutions in non-standard cases, suggesting a paradigm shift toward generation-time specialization in infrastructure software.

AIBullisharXiv – CS AI · Mar 36/104
🧠

EasySteer: A Unified Framework for High-Performance and Extensible LLM Steering

Researchers have developed EasySteer, a unified framework for controlling large language model behavior at inference time that achieves 10.8-22.3x speedup over existing frameworks. The system offers modular architecture with pre-computed steering vectors for eight application domains and transforms steering from a research technique into production-ready capability.

AIBullishHugging Face Blog · Jun 36/105
🧠

No GPU left behind: Unlocking Efficiency with Co-located vLLM in TRL

The article discusses optimizing GPU efficiency using co-located vLLM (virtual Large Language Model) infrastructure in TRL (Transformer Reinforcement Learning). This approach aims to maximize GPU utilization and reduce computational waste in AI model training and deployment.

AINeutralHugging Face Blog · Oct 31/106
🧠

Very Large Language Models and How to Evaluate Them

The article title suggests a discussion about Very Large Language Models (VLLMs) and evaluation methodologies, but the article body appears to be empty or not provided.