#cuda-kernels News & Analysis

4 articles tagged with #cuda-kernels. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

4 articles

AIBullisharXiv – CS AI · Jun 97/10

🧠

APEX4: Efficient Pure W4A4 LLM Inference via Intra-SM Compute Rebalancing

Researchers introduce APEX4, a pure INT4 inference system that addresses the long-standing challenge of W4A4 quantization in large language models by adapting compute strategies based on GPU architecture. The system achieves up to 2.09× speedup on consumer GPUs while maintaining quality within 0.63 perplexity points of FP16 baselines, making efficient LLM inference more practical across diverse hardware platforms.

$ADA🏢 Perplexity

AIBullisharXiv – CS AI · Jun 27/10

🧠

HASTE: Hardware-Aware Dynamic Sparse Training for Large Output Spaces

Researchers introduce HASTE, a hardware-aware sparse training method for extreme multi-label classification that uses group-shared fixed fan-in sparsity to optimize GPU execution. The approach achieves up to 25x speedup in backward passes compared to standard sparse methods while maintaining competitive accuracy, addressing the memory-compute bottleneck in models with millions of output labels.

AIBullisharXiv – CS AI · May 77/10

🧠

FASQ: Flexible Accelerated Subspace Quantization for Calibration-Free LLM Compression

Researchers introduce FASQ, a calibration-free compression framework for large language models that uses product quantization to achieve flexible compression ratios between 27-49% of original model size. The method outperforms existing quantization approaches like GPTQ and AWQ while enabling faster inference than FP16 on consumer GPUs through custom CUDA kernels.

🧠 Llama

AIBullisharXiv – CS AI · Apr 137/10

🧠

AlphaLab: Autonomous Multi-Agent Research Across Optimization Domains with Frontier LLMs

AlphaLab is an autonomous research system using frontier LLMs to automate experimental cycles across computational domains. Without human intervention, it explores datasets, validates frameworks, and runs large-scale experiments while accumulating domain knowledge—achieving 4.4x speedups in CUDA optimization, 22% lower validation loss in LLM pretraining, and 23-25% improvements in traffic forecasting.

🧠 GPT-5🧠 Claude🧠 Opus