27 articles tagged with #gpu-optimization. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AIBullisharXiv โ CS AI ยท Apr 77/10
๐ง Researchers have developed a new low-bit mixed-precision attention kernel called Diagonal-Tiled Mixed-Precision Attention (DMA) that significantly speeds up large language model inference on NVIDIA B200 GPUs while maintaining generation quality. The technique uses microscaling floating-point (MXFP) data format and kernel fusion to address the high computational costs of transformer-based models.
๐ข Nvidia
AIBullishMarkTechPost ยท Apr 67/10
๐ง RightNow AI has released AutoKernel, an open-source framework that uses autonomous LLM agents to optimize GPU kernels for PyTorch models. This tool aims to automate the complex process of writing efficient GPU code, addressing one of the most challenging aspects of machine learning engineering.
AIBullisharXiv โ CS AI ยท Apr 67/10
๐ง Researchers analyzed data movement patterns in large-scale Mixture of Experts (MoE) language models (200B-1000B parameters) to optimize inference performance. Their findings led to architectural modifications achieving 6.6x speedups on wafer-scale GPUs and up to 1.25x improvements on existing systems through better expert placement algorithms.
๐ข Hugging Face
AIBullisharXiv โ CS AI ยท Mar 177/10
๐ง Researchers introduce PCCL (Performant Collective Communication Library), a new optimization library for distributed deep learning that achieves up to 168x performance improvements over existing solutions like RCCL and NCCL on GPU supercomputers. The library uses hierarchical design and adaptive algorithms to scale efficiently to thousands of GPUs, delivering significant speedups in production deep learning workloads.
AIBullisharXiv โ CS AI ยท Mar 167/10
๐ง Researchers developed HeteroServe, a system that optimizes multimodal large language model inference by partitioning vision encoding and language generation across different GPU tiers. The approach reduces data transfer requirements and achieves 31-40% cost savings while improving throughput by up to 54% compared to existing systems.
AIBullisharXiv โ CS AI ยท Mar 127/10
๐ง Researchers developed KernelSkill, a multi-agent framework that optimizes GPU kernel performance using expert knowledge rather than trial-and-error approaches. The system achieved 100% success rates and significant speedups (1.92x to 5.44x) over existing methods, addressing a critical bottleneck in AI system efficiency.
AIBullisharXiv โ CS AI ยท Mar 117/10
๐ง Researchers introduce FCDM, a fully convolutional diffusion model based on ConvNeXt architecture that achieves competitive performance with DiT-XL/2 using only 50% of the computational resources. The model demonstrates exceptional training efficiency, requiring 7x fewer training steps and can be trained on just 4 GPUs, reviving convolutional networks as an efficient alternative to Transformer-based diffusion models.
AIBullisharXiv โ CS AI ยท Mar 47/102
๐ง Researchers propose SUN (Shared Use of Next-token Prediction), a novel approach for multi-LLM serving that enables cross-model sharing of decode execution by decomposing transformers into separate prefill and decode modules. The system achieves up to 2.0x throughput improvement per GPU while maintaining accuracy comparable to full fine-tuning, with a quantized version (QSUN) providing additional 45% speedup.
AIBullisharXiv โ CS AI ยท Mar 37/102
๐ง Researchers have developed FM Agent, a multi-agent AI framework that combines large language models with evolutionary search to autonomously solve complex research problems. The system achieved state-of-the-art results across multiple domains including operations research, machine learning, and GPU optimization without human intervention.
AIBullisharXiv โ CS AI ยท Mar 37/103
๐ง Researchers introduce FreeKV, a training-free optimization framework that dramatically improves KV cache retrieval efficiency for large language models with long context windows. The system achieves up to 13x speedup compared to existing methods while maintaining near-lossless accuracy through speculative retrieval and hybrid memory layouts.
$NEAR
AIBullisharXiv โ CS AI ยท Mar 37/104
๐ง Researchers have developed AReaL, a new asynchronous reinforcement learning system that dramatically improves the efficiency of training large language models for reasoning tasks. The system achieves up to 2.77x training speedup compared to traditional synchronous methods by decoupling generation from training processes.
AIBullisharXiv โ CS AI ยท Mar 37/103
๐ง Researchers introduce RACE Attention, a new linear-time alternative to traditional Softmax Attention that can process up to 75 million tokens in a single pass, compared to current GPU-optimized implementations that fail beyond 4 million tokens. The technology uses angular similarity and Gaussian random projections to achieve dramatic efficiency gains while maintaining performance across language modeling and classification tasks.
AIBullisharXiv โ CS AI ยท Feb 277/105
๐ง Researchers introduce K-Search, a new GPU kernel optimization framework that uses co-evolving world models with LLMs to significantly improve performance over existing methods. The system achieves up to 14.3x performance gains on complex kernels by decoupling high-level planning from low-level implementation, addressing limitations of current automated optimization approaches.
AIBullisharXiv โ CS AI ยท Apr 66/10
๐ง Researchers introduce InCoder-32B-Thinking, an AI model trained with Error-driven Chain-of-Thought (ECoT) framework and Industrial Code World Model (ICWM) for industrial software development. The model generates reasoning traces for hardware-constrained programming and achieves top-tier performance on 23 benchmarks, scoring 81.3% on LiveCodeBench v5 and 84.0% on CAD-Coder.
AIBullisharXiv โ CS AI ยท Mar 37/107
๐ง Researchers introduce Attn-QAT, the first systematic approach to 4-bit quantization-aware training for attention mechanisms in AI models. The method enables stable FP4 computation on emerging GPUs and delivers up to 1.5x speedup on RTX 5090 while maintaining model quality across diffusion and language models.
AIBullisharXiv โ CS AI ยท Mar 37/107
๐ง Researchers introduce Whisper-MLA, a modified version of OpenAI's Whisper speech recognition model that uses Multi-Head Latent Attention to reduce GPU memory consumption by up to 87.5% while maintaining accuracy. The innovation addresses a key scalability issue with transformer-based ASR models when processing long-form audio.
AIBullisharXiv โ CS AI ยท Mar 37/1010
๐ง TriMoE introduces a novel GPU-CPU-NDP architecture that optimizes large Mixture-of-Experts model inference by strategically mapping hot, warm, and cold experts to their optimal compute units. The system leverages AMX-enabled CPUs and includes bottleneck-aware scheduling, achieving up to 2.83x performance improvements over existing solutions.
AIBullisharXiv โ CS AI ยท Mar 36/103
๐ง TiledAttention is a new CUDA-based scaled dot-product attention kernel for PyTorch that enables easier modification of attention mechanisms for AI research. It provides a balance between performance and customizability, delivering significant speedups over standard attention implementations while remaining directly editable from Python.
$DOT
AIBullisharXiv โ CS AI ยท Mar 36/103
๐ง Researchers have introduced PiKV, an open-source KV cache management framework designed to optimize memory and communication costs for Mixture of Experts (MoE) language models across multi-GPU and multi-node inference. The system uses expert-sharded storage, intelligent routing, adaptive scheduling, and compression to improve efficiency in large-scale AI model deployment.
AIBullisharXiv โ CS AI ยท Mar 36/104
๐ง Researchers introduce DISCO, a new method for efficiently evaluating machine learning models by selecting samples that maximize disagreement between models rather than relying on complex clustering approaches. The technique achieves state-of-the-art results in performance prediction while reducing the computational cost of model evaluation.
AIBullisharXiv โ CS AI ยท Mar 26/1017
๐ง Researchers developed a data-driven pipeline to optimize GPU efficiency for distributed LLM adapter serving, achieving sub-5% throughput estimation error while running 90x faster than full benchmarking. The system uses a Digital Twin, machine learning models, and greedy placement algorithms to minimize GPU requirements while serving hundreds of adapters concurrently.
AIBullisharXiv โ CS AI ยท Mar 27/1013
๐ง Researchers developed CUDA Agent, a reinforcement learning system that significantly outperforms existing methods for GPU kernel optimization, achieving 100% faster performance than torch.compile on benchmark tests. The system uses large-scale agentic RL with automated verification and profiling to improve CUDA kernel generation, addressing a critical bottleneck in deep learning performance.
AIBullisharXiv โ CS AI ยท Mar 26/1015
๐ง Researchers propose OM2P, a new offline multi-agent reinforcement learning algorithm that achieves efficient one-step action sampling using mean-flow models. The approach delivers up to 3.8x reduction in GPU memory usage and 10.8x speed-up in training time compared to existing diffusion and flow-based models.
AIBullishHugging Face Blog ยท Jun 36/105
๐ง The article discusses optimizing GPU efficiency using co-located vLLM (virtual Large Language Model) infrastructure in TRL (Transformer Reinforcement Learning). This approach aims to maximize GPU utilization and reduce computational waste in AI model training and deployment.
AIBullishHugging Face Blog ยท Sep 136/104
๐ง The article discusses fine-tuning Meta's Llama 2 70B large language model using PyTorch's Fully Sharded Data Parallel (FSDP) technique. This approach enables efficient training of large AI models by distributing parameters across multiple GPUs, making advanced AI model customization more accessible.