y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#vllm News & Analysis

9 articles tagged with #vllm. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

9 articles
AIBullisharXiv โ€“ CS AI ยท 3d ago7/10
๐Ÿง 

Ragged Paged Attention: A High-Performance and Flexible LLM Inference Kernel for TPU

Researchers introduced Ragged Paged Attention (RPA), a specialized inference kernel optimized for Google's TPUs that enables efficient large language model deployment. The innovation addresses the GPU-centric design of existing LLM serving systems by implementing fine-grained tiling and custom software pipelines, achieving up to 86% memory bandwidth utilization on TPU hardware.

๐Ÿง  Llama
AIBullisharXiv โ€“ CS AI ยท Apr 147/10
๐Ÿง 

SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding

Researchers introduce SPEED-Bench, a comprehensive benchmark suite for evaluating Speculative Decoding (SD) techniques that accelerate LLM inference. The benchmark addresses critical gaps in existing evaluation methods by offering diverse semantic domains, throughput-oriented testing across multiple concurrency levels, and integration with production systems like vLLM and TensorRT-LLM, enabling more accurate real-world performance measurement.

AIBullisharXiv โ€“ CS AI ยท Mar 277/10
๐Ÿง 

Model2Kernel: Model-Aware Symbolic Execution For Safe CUDA Kernels

Researchers developed Model2Kernel, a system that automatically detects memory safety bugs in CUDA kernels used for large language model inference. The system discovered 353 previously unknown bugs across popular platforms like vLLM and Hugging Face with only nine false positives.

๐Ÿข Hugging Face
AIBullisharXiv โ€“ CS AI ยท Mar 177/10
๐Ÿง 

Justitia: Fair and Efficient Scheduling of Task-parallel LLM Agents with Selective Pampering

Justitia is a new scheduling system for task-parallel LLM agents that optimizes GPU server performance through selective resource allocation based on completion order prediction. The system uses memory-centric cost quantification and virtual-time fair queuing to achieve both efficiency and fairness in LLM serving environments.

๐Ÿข Meta
AIBullisharXiv โ€“ CS AI ยท Mar 36/104
๐Ÿง 

EasySteer: A Unified Framework for High-Performance and Extensible LLM Steering

Researchers have developed EasySteer, a unified framework for controlling large language model behavior at inference time that achieves 10.8-22.3x speedup over existing frameworks. The system offers modular architecture with pre-computed steering vectors for eight application domains and transforms steering from a research technique into production-ready capability.

AIBullishHugging Face Blog ยท Jun 36/105
๐Ÿง 

No GPU left behind: Unlocking Efficiency with Co-located vLLM in TRL

The article discusses optimizing GPU efficiency using co-located vLLM (virtual Large Language Model) infrastructure in TRL (Transformer Reinforcement Learning). This approach aims to maximize GPU utilization and reduce computational waste in AI model training and deployment.

AIBullishHugging Face Blog ยท Jan 166/106
๐Ÿง 

Introducing multi-backends (TRT-LLM, vLLM) support for Text Generation Inference

Text Generation Inference introduces multi-backend support for TRT-LLM and vLLM, expanding deployment options for AI text generation models. This development enhances flexibility and performance optimization capabilities for developers working with large language models.

AINeutralHugging Face Blog ยท Oct 31/106
๐Ÿง 

Very Large Language Models and How to Evaluate Them

The article title suggests a discussion about Very Large Language Models (VLLMs) and evaluation methodologies, but the article body appears to be empty or not provided.