#gpu-optimization News & Analysis

46 articles tagged with #gpu-optimization. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

46 articles

AIBullisharXiv – CS AI · 3d ago6/10

🧠

Regression Language Models for Code

Researchers have developed Regression Language Models (RLMs) that use frozen LLM encoders to predict numeric code execution outcomes across multiple programming languages and domains. A 300M parameter model demonstrates strong performance predicting memory footprint, GPU latency, neural network accuracy, and hardware platform performance without domain-specific feature engineering.

AINeutralarXiv – CS AI · 4d ago6/10

🧠

Efficient On-policy Visual-RL via Stochastic Decoupled Policy Gradient

Researchers introduce SDPG, a visual reinforcement learning method that trains robotic control policies significantly faster and more efficiently on consumer GPUs. The approach reduces computational overhead through stochastic gradient estimation while maintaining superior performance, and includes new benchmarks for advancing visual robotics research.

🏢 Nvidia

AINeutralarXiv – CS AI · May 125/10

🧠

Contextual Plackett-Luce: An Efficient Neural Model for Probabilistic Sequence Selection under Ambiguity

Researchers propose Contextual Plackett-Luce (CPL), a neural probabilistic model for sequence selection that balances computational efficiency with representational flexibility. The model addresses the challenge of predicting multi-modal outputs from single training examples by combining parallel scoring with lightweight autoregressive selection, demonstrating improvements on path prediction and subset selection tasks.

AIBullisharXiv – CS AI · May 126/10

🧠

Geometric 4D Stitching for Grounded 4D Generation

Researchers introduce Geometric 4D Stitching, a novel framework that improves 4D scene generation by explicitly identifying and filling geometric gaps with geometrically consistent components. The method achieves efficient 4D scene reconstruction in under 10 minutes on consumer hardware while supporting iterative scene expansion and editing capabilities.

🏢 Nvidia

AIBullishHugging Face Blog · May 116/10

🧠

Building Blocks for Foundation Model Training and Inference on AWS

AWS announced new building blocks and infrastructure optimizations for training and deploying foundation models, aimed at reducing computational costs and complexity for developers. The initiative addresses growing demand for accessible AI infrastructure as foundation model adoption accelerates across enterprises.

AIBullisharXiv – CS AI · May 116/10

🧠

Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding

SAVEMem is a training-free framework that improves real-time video understanding by incorporating semantic awareness into memory management rather than relying solely on visual similarity. The system achieves significant performance gains on streaming video benchmarks while reducing GPU memory consumption by 48%, demonstrating practical advances in efficient AI model inference.

AINeutralarXiv – CS AI · May 76/10

🧠

Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs

Coral is a new multi-LLM serving system that optimizes resource allocation across heterogeneous cloud GPUs to reduce inference costs by up to 2.79x. The system uses a two-stage decomposition algorithm that maintains optimal performance while reducing optimization time from hours to seconds, enabling dynamic adaptation to changing demand and resource availability.

AIBullisharXiv – CS AI · Apr 66/10

🧠

InCoder-32B-Thinking: Industrial Code World Model for Thinking

Researchers introduce InCoder-32B-Thinking, an AI model trained with Error-driven Chain-of-Thought (ECoT) framework and Industrial Code World Model (ICWM) for industrial software development. The model generates reasoning traces for hardware-constrained programming and achieves top-tier performance on 23 benchmarks, scoring 81.3% on LiveCodeBench v5 and 84.0% on CAD-Coder.

AIBullisharXiv – CS AI · Mar 37/107

🧠

Attn-QAT: 4-Bit Attention With Quantization-Aware Training

Researchers introduce Attn-QAT, the first systematic approach to 4-bit quantization-aware training for attention mechanisms in AI models. The method enables stable FP4 computation on emerging GPUs and delivers up to 1.5x speedup on RTX 5090 while maintaining model quality across diffusion and language models.

AIBullisharXiv – CS AI · Mar 37/107

🧠

Whisper-MLA: Reducing GPU Memory Consumption of ASR Models based on MHA2MLA Conversion

Researchers introduce Whisper-MLA, a modified version of OpenAI's Whisper speech recognition model that uses Multi-Head Latent Attention to reduce GPU memory consumption by up to 87.5% while maintaining accuracy. The innovation addresses a key scalability issue with transformer-based ASR models when processing long-form audio.

AIBullisharXiv – CS AI · Mar 37/1010

🧠

TriMoE: Augmenting GPU with AMX-Enabled CPU and DIMM-NDP for High-Throughput MoE Inference via Offloading

TriMoE introduces a novel GPU-CPU-NDP architecture that optimizes large Mixture-of-Experts model inference by strategically mapping hot, warm, and cold experts to their optimal compute units. The system leverages AMX-enabled CPUs and includes bottleneck-aware scheduling, achieving up to 2.83x performance improvements over existing solutions.

AIBullisharXiv – CS AI · Mar 36/103

🧠

TiledAttention: a CUDA Tile SDPA Kernel for PyTorch

TiledAttention is a new CUDA-based scaled dot-product attention kernel for PyTorch that enables easier modification of attention mechanisms for AI research. It provides a balance between performance and customizability, delivering significant speedups over standard attention implementations while remaining directly editable from Python.

$DOT

AIBullisharXiv – CS AI · Mar 36/103

🧠

PiKV: KV Cache Management System for Mixture of Experts

Researchers have introduced PiKV, an open-source KV cache management framework designed to optimize memory and communication costs for Mixture of Experts (MoE) language models across multi-GPU and multi-node inference. The system uses expert-sharded storage, intelligent routing, adaptive scheduling, and compression to improve efficiency in large-scale AI model deployment.

AIBullisharXiv – CS AI · Mar 36/104

🧠

DISCO: Diversifying Sample Condensation for Efficient Model Evaluation

Researchers introduce DISCO, a new method for efficiently evaluating machine learning models by selecting samples that maximize disagreement between models rather than relying on complex clustering approaches. The technique achieves state-of-the-art results in performance prediction while reducing the computational cost of model evaluation.

AIBullisharXiv – CS AI · Mar 26/1017

🧠

Data Driven Optimization of GPU efficiency for Distributed LLM Adapter Serving

Researchers developed a data-driven pipeline to optimize GPU efficiency for distributed LLM adapter serving, achieving sub-5% throughput estimation error while running 90x faster than full benchmarking. The system uses a Digital Twin, machine learning models, and greedy placement algorithms to minimize GPU requirements while serving hundreds of adapters concurrently.

AIBullisharXiv – CS AI · Mar 27/1013

🧠

CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation

Researchers developed CUDA Agent, a reinforcement learning system that significantly outperforms existing methods for GPU kernel optimization, achieving 100% faster performance than torch.compile on benchmark tests. The system uses large-scale agentic RL with automated verification and profiling to improve CUDA kernel generation, addressing a critical bottleneck in deep learning performance.

AIBullisharXiv – CS AI · Mar 26/1015

🧠

OM2P: Offline Multi-Agent Mean-Flow Policy

Researchers propose OM2P, a new offline multi-agent reinforcement learning algorithm that achieves efficient one-step action sampling using mean-flow models. The approach delivers up to 3.8x reduction in GPU memory usage and 10.8x speed-up in training time compared to existing diffusion and flow-based models.

AIBullishHugging Face Blog · Jun 36/105

🧠

No GPU left behind: Unlocking Efficiency with Co-located vLLM in TRL

The article discusses optimizing GPU efficiency using co-located vLLM (virtual Large Language Model) infrastructure in TRL (Transformer Reinforcement Learning). This approach aims to maximize GPU utilization and reduce computational waste in AI model training and deployment.

AIBullishHugging Face Blog · Sep 136/104

🧠

Fine-tuning Llama 2 70B using PyTorch FSDP

The article discusses fine-tuning Meta's Llama 2 70B large language model using PyTorch's Fully Sharded Data Parallel (FSDP) technique. This approach enables efficient training of large AI models by distributing parameters across multiple GPUs, making advanced AI model customization more accessible.

AIBullisharXiv – CS AI · Mar 34/104

🧠

Depth-Structured Music Recurrence: Budgeted Recurrent Attention for Full-Piece Symbolic Music Modeling

Researchers introduce Depth-Structured Music Recurrence (DSMR), a new AI training method for symbolic music generation that processes complete compositions efficiently. The technique uses stateful recurrent attention with distributed memory across layers, achieving similar performance to full-memory models while using 59% less GPU memory and 36% higher throughput.

AIBullishHugging Face Blog · May 25/104

🧠

Accelerate Large Model Training using PyTorch Fully Sharded Data Parallel

The article discusses PyTorch Fully Sharded Data Parallel (FSDP), a technique for accelerating large AI model training by distributing model parameters, gradients, and optimizer states across multiple GPUs. This approach enables training of larger models that wouldn't fit on single devices while improving training efficiency and speed.

← PrevPage 2 of 2