#cuda-optimization News & Analysis

4 articles tagged with #cuda-optimization. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

4 articles

AIBullisharXiv – CS AI · Jun 97/10

🧠

AgentCompile: An LLM-Guided Compiler for Direct CUDA Inference

AgentCompile is an LLM-guided CUDA inference compiler that uses large language models to optimize transformer model execution on GPUs. The system achieves 4-5.66x speedup over PyTorch across popular models like Qwen and Llama through intelligent specialization decisions and empirical validation.

🧠 Llama

AIBullisharXiv – CS AI · Jun 57/10

🧠

CuTeGen: An LLM-Based Agentic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe

CuTeGen is an AI-powered framework that automates GPU kernel generation and optimization using large language models and the CuTe abstraction layer. The system achieves 1.71× average speedup over PyTorch on standardized benchmarks by employing a generate-test-refine workflow with delayed performance profiling, significantly outperforming prior agentic approaches.

AINeutralarXiv – CS AI · May 276/10

🧠

Towards Feedback-to-Plan Decisions for Self-Evolving LLM Agents in CUDA Kernel Generation

Researchers introduce CUDAnalyst, a new analysis framework that reveals how large language models make planning decisions when generating CUDA kernels by decomposing feedback signals. The study demonstrates that explicit planning helps only when feedback is well-aligned and that effective planning emerges from structured multi-feedback interactions, with findings showing robustness across different models and workloads.

AIBullisharXiv – CS AI · Apr 206/10

🧠

cuNNQS-SCI: A Fully GPU-Accelerated Framework for High-Performance Configuration Interaction Selection withNeural Network QQantum States

Researchers introduced cuNNQS-SCI, a fully GPU-accelerated framework that solves a critical scalability bottleneck in neural network quantum state methods for solving complex quantum systems. The system achieves 2.32X speedup over previous CPU-GPU hybrid approaches while maintaining chemical accuracy, demonstrating 90%+ parallel efficiency across 64 GPUs.

🏢 Nvidia