9 articles tagged with #vllm. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AIBullisharXiv โ CS AI ยท 3d ago7/10
๐ง Researchers introduced Ragged Paged Attention (RPA), a specialized inference kernel optimized for Google's TPUs that enables efficient large language model deployment. The innovation addresses the GPU-centric design of existing LLM serving systems by implementing fine-grained tiling and custom software pipelines, achieving up to 86% memory bandwidth utilization on TPU hardware.
๐ง Llama
AIBullisharXiv โ CS AI ยท Apr 147/10
๐ง Researchers introduce SPEED-Bench, a comprehensive benchmark suite for evaluating Speculative Decoding (SD) techniques that accelerate LLM inference. The benchmark addresses critical gaps in existing evaluation methods by offering diverse semantic domains, throughput-oriented testing across multiple concurrency levels, and integration with production systems like vLLM and TensorRT-LLM, enabling more accurate real-world performance measurement.
AIBullisharXiv โ CS AI ยท Mar 277/10
๐ง Researchers developed Model2Kernel, a system that automatically detects memory safety bugs in CUDA kernels used for large language model inference. The system discovered 353 previously unknown bugs across popular platforms like vLLM and Hugging Face with only nine false positives.
๐ข Hugging Face
AIBullisharXiv โ CS AI ยท Mar 177/10
๐ง Justitia is a new scheduling system for task-parallel LLM agents that optimizes GPU server performance through selective resource allocation based on completion order prediction. The system uses memory-centric cost quantification and virtual-time fair queuing to achieve both efficiency and fairness in LLM serving environments.
๐ข Meta
AINeutralarXiv โ CS AI ยท Mar 127/10
๐ง Researchers conducted comprehensive benchmarks of LLM inference on AMD Instinct MI325X GPUs, testing models from 235B to 1 trillion parameters. The study reveals that architecture-aware optimization is critical, with different model types requiring specific configurations for optimal performance on AMD hardware.
๐ง Llama
AIBullisharXiv โ CS AI ยท Mar 36/104
๐ง Researchers have developed EasySteer, a unified framework for controlling large language model behavior at inference time that achieves 10.8-22.3x speedup over existing frameworks. The system offers modular architecture with pre-computed steering vectors for eight application domains and transforms steering from a research technique into production-ready capability.
AIBullishHugging Face Blog ยท Jun 36/105
๐ง The article discusses optimizing GPU efficiency using co-located vLLM (virtual Large Language Model) infrastructure in TRL (Transformer Reinforcement Learning). This approach aims to maximize GPU utilization and reduce computational waste in AI model training and deployment.
AIBullishHugging Face Blog ยท Jan 166/106
๐ง Text Generation Inference introduces multi-backend support for TRT-LLM and vLLM, expanding deployment options for AI text generation models. This development enhances flexibility and performance optimization capabilities for developers working with large language models.
AINeutralHugging Face Blog ยท Oct 31/106
๐ง The article title suggests a discussion about Very Large Language Models (VLLMs) and evaluation methodologies, but the article body appears to be empty or not provided.