y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#cuda News & Analysis

11 articles tagged with #cuda. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

11 articles
AIBullisharXiv โ€“ CS AI ยท Mar 277/10
๐Ÿง 

Model2Kernel: Model-Aware Symbolic Execution For Safe CUDA Kernels

Researchers developed Model2Kernel, a system that automatically detects memory safety bugs in CUDA kernels used for large language model inference. The system discovered 353 previously unknown bugs across popular platforms like vLLM and Hugging Face with only nine false positives.

๐Ÿข Hugging Face
AINeutralarXiv โ€“ CS AI ยท Mar 46/104
๐Ÿง 

CUDABench: Benchmarking LLMs for Text-to-CUDA Generation

Researchers introduce CUDABench, a comprehensive benchmark for evaluating Large Language Models' ability to generate CUDA code from text descriptions. The benchmark reveals significant challenges including high compilation success rates but low functional correctness, lack of domain-specific knowledge, and poor GPU hardware utilization.

AIBullisharXiv โ€“ CS AI ยท Mar 46/102
๐Ÿง 

GPUTOK: GPU Accelerated Byte Level BPE Tokenization

Researchers developed GPUTOK, a GPU-accelerated tokenizer for large language models that processes text significantly faster than existing CPU-based solutions. The optimized version shows 1.7x speed improvement over tiktoken and 7.6x over HuggingFace's GPT-2 tokenizer while maintaining output quality.

AIBullishOpenAI News ยท Jul 287/106
๐Ÿง 

Introducing Triton: Open-source GPU programming for neural networks

OpenAI has released Triton 1.0, an open-source Python-like programming language that allows researchers without CUDA expertise to write highly efficient GPU code for neural networks. The tool aims to democratize GPU programming by making it accessible to those without specialized hardware programming knowledge while maintaining performance comparable to expert-level code.

AIBullisharXiv โ€“ CS AI ยท Mar 176/10
๐Ÿง 

Self-Indexing KVCache: Predicting Sparse Attention from Compressed Keys

Researchers propose a novel self-indexing KV cache system that unifies compression and retrieval for efficient sparse attention in large language models. The method uses 1-bit vector quantization and integrates with FlashAttention to reduce memory bottlenecks in long-context LLM inference.

AIBullisharXiv โ€“ CS AI ยท Mar 36/103
๐Ÿง 

TiledAttention: a CUDA Tile SDPA Kernel for PyTorch

TiledAttention is a new CUDA-based scaled dot-product attention kernel for PyTorch that enables easier modification of attention mechanisms for AI research. It provides a balance between performance and customizability, delivering significant speedups over standard attention implementations while remaining directly editable from Python.

$DOT
AIBullisharXiv โ€“ CS AI ยท Mar 27/1013
๐Ÿง 

CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation

Researchers developed CUDA Agent, a reinforcement learning system that significantly outperforms existing methods for GPU kernel optimization, achieving 100% faster performance than torch.compile on benchmark tests. The system uses large-scale agentic RL with automated verification and profiling to improve CUDA kernel generation, addressing a critical bottleneck in deep learning performance.

AIBullishHugging Face Blog ยท Jan 286/105
๐Ÿง 

We Got Claude to Build CUDA Kernels and teach open models!

The article discusses using Claude AI to build CUDA kernels and teach open-source models, demonstrating AI's capability in low-level programming and knowledge transfer. This represents a significant advancement in AI-assisted development and model training techniques.

AIBullishOpenAI News ยท Dec 66/107
๐Ÿง 

Block-sparse GPU kernels

A company has released highly-optimized GPU kernels for block-sparse neural network architectures that can run orders of magnitude faster than existing solutions like cuBLAS or cuSPARSE. These kernels have achieved state-of-the-art results in text sentiment analysis and generative modeling applications.

AINeutralMarkTechPost ยท Apr 64/10
๐Ÿง 

An Implementation Guide to Running NVIDIA Transformer Engine with Mixed Precision, FP8 Checks, Benchmarking, and Fallback Execution

A technical tutorial demonstrates implementing NVIDIA's Transformer Engine with mixed-precision acceleration, covering GPU setup, CUDA compatibility verification, and fallback execution handling. The guide focuses on practical deep learning workflow optimization using FP8 precision and benchmarking techniques.

๐Ÿข Nvidia