#gpu-optimization News & Analysis

64 articles tagged with #gpu-optimization. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

64 articles

AIBullisharXiv – CS AI · Jun 197/10

🧠

StreamKL: Fast and Memory-Efficient KL Divergence for Boosting Attention Distillation

Researchers introduce StreamKL, a novel GPU optimization for computing KL divergence in attention distillation that reduces memory requirements from O(N_Q N_K) to O(1) and delivers up to 43x forward-pass speedups. This advancement enables efficient knowledge distillation and model compression for long-context language models on standard hardware.

AIBullisharXiv – CS AI · Jun 117/10

🧠

ICA Lens: Interpreting Language Models Without Training Another Dictionary

Researchers introduce ICALens, a new method for interpreting language model representations using independent component analysis (ICA) instead of expensive sparse autoencoders (SAEs). The approach efficiently recovers interpretable directions without requiring large neural dictionary training, achieving competitive performance on standard benchmarks while offering a faster, more accessible alternative for LLM analysis.

AIBullisharXiv – CS AI · Jun 97/10

🧠

Resource-aware Computation-Communication Overlap for multi-GPU ML Workloads

Researchers have developed a method to improve multi-GPU machine learning training by enabling computation and communication to execute simultaneously using shared-memory allocation and scheduling priority adjustments. The technique demonstrates up to 25.5% execution time reduction across NVIDIA and AMD GPUs without requiring modifications to vendor libraries.

🏢 Nvidia

AIBullisharXiv – CS AI · Jun 97/10

🧠

STAR-KV: Low-Rank KV Cache Compression via Soft Thresholding for Adaptive Rank Control

Researchers introduce STAR-KV, an adaptive compression framework that reduces KV cache memory requirements in large language models by up to 75% through low-rank projections and intelligent rank selection. The technique achieves up to 20x compression when combined with quantization and delivers significant speedups in attention computation, addressing a critical bottleneck in LLM inference efficiency.

AIBullisharXiv – CS AI · Jun 97/10

🧠

APEX4: Efficient Pure W4A4 LLM Inference via Intra-SM Compute Rebalancing

Researchers introduce APEX4, a pure INT4 inference system that addresses the long-standing challenge of W4A4 quantization in large language models by adapting compute strategies based on GPU architecture. The system achieves up to 2.09× speedup on consumer GPUs while maintaining quality within 0.63 perplexity points of FP16 baselines, making efficient LLM inference more practical across diverse hardware platforms.

$ADA🏢 Perplexity

AIBullisharXiv – CS AI · Jun 87/10

🧠

E2Former-V2: On-the-Fly Equivariant Attention with Linear Activation Memory

Researchers introduce E2Former-V2, a more scalable architecture for Equivariant Graph Neural Networks that models 3D molecular systems. By combining algebraic sparsity with hardware-optimized execution, the model achieves 20× computational improvements while maintaining competitive accuracy on molecular datasets.

AIBullisharXiv – CS AI · Jun 27/10

🧠

HASTE: Hardware-Aware Dynamic Sparse Training for Large Output Spaces

Researchers introduce HASTE, a hardware-aware sparse training method for extreme multi-label classification that uses group-shared fixed fan-in sparsity to optimize GPU execution. The approach achieves up to 25x speedup in backward passes compared to standard sparse methods while maintaining competitive accuracy, addressing the memory-compute bottleneck in models with millions of output labels.

AIBullisharXiv – CS AI · Jun 27/10

🧠

APB-V: Accelerating Long-Video Understanding via Sequence-Parallelism-aware Approximate Attention

Researchers introduce APB-V, a sequence-parallel framework that accelerates long-video inference in Large Multimodal Models by distributing approximate attention across multiple GPUs. The approach achieves 12.72x speedup over FlashAttn while processing longer videos without visual compression, addressing a critical bottleneck in AI video understanding.