AIBullisharXiv – CS AI · 3d ago7/10
🧠PromptEmbedder introduces a dual-LLM framework that decouples text embedding from specific model architectures, achieving comparable performance to LoRA while reducing GPU memory by 40% and accelerating training 3.7x. The innovation enables efficient transfer across different LLM backbones by retraining only a lightweight alignment matrix rather than entire models.
AIBullisharXiv – CS AI · 3d ago7/10
🧠Researchers present a systematic study of Attention-FFN Disaggregation (AFD), a technique that separates attention and expert layers across different GPU groups to optimize inference serving for Mixture-of-Experts language models. The framework demonstrates that AFD enables 4k tokens/s throughput on DeepSeek-V3.2 under strict latency constraints where traditional disaggregation approaches fail, providing design principles for scaling LLM infrastructure.
AIBullisharXiv – CS AI · 4d ago7/10
🧠Researchers introduce Qrita, an efficient algorithm for Top-k and Top-p sampling in large language models that uses pivot-based truncation instead of sorting. The method achieves 1.4x throughput improvements with 50% less memory usage while maintaining identical output to traditional sorting approaches, and has been adopted as the default sampler in vLLM.
AIBullisharXiv – CS AI · May 127/10
🧠FlashSVD v1.5 addresses a critical gap between theoretical and practical performance gains in SVD-compressed transformer inference, delivering up to 2.55x speedup through runtime optimization rather than algorithmic improvements alone. The work demonstrates that low-rank compression benefits require co-designed inference systems to translate parameter reduction into actual serving speed improvements.
AIBullisharXiv – CS AI · May 117/10
🧠Researchers propose ESSAM, a novel training framework combining Evolution Strategies with Sharpness-Aware Maximization to fine-tune large language models for mathematical reasoning while dramatically reducing GPU memory requirements. The approach achieves comparable accuracy to reinforcement learning methods like PPO and GRPO while using 18-10× less memory, addressing a critical bottleneck in LLM development.
AIBullisharXiv – CS AI · May 117/10
🧠Researchers introduce MISA, an optimization technique that reduces computational costs in DeepSeek's sparse attention mechanism for large language models by treating indexer heads as a mixture-of-experts system. The method achieves 3.82x speedup on GPU inference while maintaining performance across benchmarks, addressing a key bottleneck in long-context LLM processing.
🏢 Nvidia
AIBullisharXiv – CS AI · May 77/10
🧠Researchers introduce a queueing-theoretic framework that models LLM inference stability by accounting for both computational and GPU memory constraints from KV caching. The framework derives conditions for service stability and enables operators to calculate optimal cluster sizes for efficient GPU provisioning, with experimental validation showing predictions within 10% accuracy.
AIBullisharXiv – CS AI · May 17/10
🧠Researchers introduce RoundPipe, a novel pipeline scheduling algorithm that enables efficient fine-tuning of large language models on consumer-grade GPUs by eliminating the weight binding constraint that causes computational bottlenecks. The system achieves 1.48-2.16x speedups over existing approaches and enables fine-tuning of models with up to 235 billion parameters on standard hardware.
AIBullisharXiv – CS AI · Apr 147/10
🧠Researchers introduce Deep Optimizer States, a technique that reduces GPU memory constraints during large language model training by dynamically offloading optimizer state between host and GPU memory during computation cycles. The method achieves 2.5× faster iterations compared to existing approaches by better managing the memory fluctuations inherent in transformer training pipelines.
AIBullisharXiv – CS AI · Apr 147/10
🧠Researchers identify dimensional misalignment as a critical bottleneck in compressed large language models, where parameter reduction fails to improve GPU performance due to hardware-incompatible tensor dimensions. They propose GAC (GPU-Aligned Compression), a new optimization method that achieves up to 1.5× speedup while maintaining model quality by ensuring hardware-friendly dimensions.
🧠 Llama
AIBullisharXiv – CS AI · Apr 137/10
🧠TensorHub introduces Reference-Oriented Storage (ROS), a novel weight transfer system that enables efficient reinforcement learning training across distributed GPU clusters without physically copying model weights. The production-deployed system achieves significant performance improvements, reducing GPU stall time by up to 6.7x for rollout operations and improving cross-datacenter transfers by 19x.
AIBullisharXiv – CS AI · Apr 77/10
🧠Researchers have developed a new low-bit mixed-precision attention kernel called Diagonal-Tiled Mixed-Precision Attention (DMA) that significantly speeds up large language model inference on NVIDIA B200 GPUs while maintaining generation quality. The technique uses microscaling floating-point (MXFP) data format and kernel fusion to address the high computational costs of transformer-based models.
🏢 Nvidia
AIBullishMarkTechPost · Apr 67/10
🧠RightNow AI has released AutoKernel, an open-source framework that uses autonomous LLM agents to optimize GPU kernels for PyTorch models. This tool aims to automate the complex process of writing efficient GPU code, addressing one of the most challenging aspects of machine learning engineering.
AIBullisharXiv – CS AI · Apr 67/10
🧠Researchers analyzed data movement patterns in large-scale Mixture of Experts (MoE) language models (200B-1000B parameters) to optimize inference performance. Their findings led to architectural modifications achieving 6.6x speedups on wafer-scale GPUs and up to 1.25x improvements on existing systems through better expert placement algorithms.
🏢 Hugging Face
AIBullisharXiv – CS AI · Mar 177/10
🧠Researchers introduce PCCL (Performant Collective Communication Library), a new optimization library for distributed deep learning that achieves up to 168x performance improvements over existing solutions like RCCL and NCCL on GPU supercomputers. The library uses hierarchical design and adaptive algorithms to scale efficiently to thousands of GPUs, delivering significant speedups in production deep learning workloads.
AIBullisharXiv – CS AI · Mar 167/10
🧠Researchers developed HeteroServe, a system that optimizes multimodal large language model inference by partitioning vision encoding and language generation across different GPU tiers. The approach reduces data transfer requirements and achieves 31-40% cost savings while improving throughput by up to 54% compared to existing systems.
AIBullisharXiv – CS AI · Mar 127/10
🧠Researchers developed KernelSkill, a multi-agent framework that optimizes GPU kernel performance using expert knowledge rather than trial-and-error approaches. The system achieved 100% success rates and significant speedups (1.92x to 5.44x) over existing methods, addressing a critical bottleneck in AI system efficiency.
AIBullisharXiv – CS AI · Mar 117/10
🧠Researchers introduce FCDM, a fully convolutional diffusion model based on ConvNeXt architecture that achieves competitive performance with DiT-XL/2 using only 50% of the computational resources. The model demonstrates exceptional training efficiency, requiring 7x fewer training steps and can be trained on just 4 GPUs, reviving convolutional networks as an efficient alternative to Transformer-based diffusion models.
AIBullisharXiv – CS AI · Mar 47/102
🧠Researchers propose SUN (Shared Use of Next-token Prediction), a novel approach for multi-LLM serving that enables cross-model sharing of decode execution by decomposing transformers into separate prefill and decode modules. The system achieves up to 2.0x throughput improvement per GPU while maintaining accuracy comparable to full fine-tuning, with a quantized version (QSUN) providing additional 45% speedup.
AIBullisharXiv – CS AI · Mar 37/104
🧠Researchers have developed AReaL, a new asynchronous reinforcement learning system that dramatically improves the efficiency of training large language models for reasoning tasks. The system achieves up to 2.77x training speedup compared to traditional synchronous methods by decoupling generation from training processes.
AIBullisharXiv – CS AI · Mar 37/103
🧠Researchers introduce FreeKV, a training-free optimization framework that dramatically improves KV cache retrieval efficiency for large language models with long context windows. The system achieves up to 13x speedup compared to existing methods while maintaining near-lossless accuracy through speculative retrieval and hybrid memory layouts.
$NEAR
AIBullisharXiv – CS AI · Mar 37/102
🧠Researchers have developed FM Agent, a multi-agent AI framework that combines large language models with evolutionary search to autonomously solve complex research problems. The system achieved state-of-the-art results across multiple domains including operations research, machine learning, and GPU optimization without human intervention.
AIBullisharXiv – CS AI · Mar 37/103
🧠Researchers introduce RACE Attention, a new linear-time alternative to traditional Softmax Attention that can process up to 75 million tokens in a single pass, compared to current GPU-optimized implementations that fail beyond 4 million tokens. The technology uses angular similarity and Gaussian random projections to achieve dramatic efficiency gains while maintaining performance across language modeling and classification tasks.
AIBullisharXiv – CS AI · Feb 277/105
🧠Researchers introduce K-Search, a new GPU kernel optimization framework that uses co-evolving world models with LLMs to significantly improve performance over existing methods. The system achieves up to 14.3x performance gains on complex kernels by decoupling high-level planning from low-level implementation, addressing limitations of current automated optimization approaches.
AIBullisharXiv – CS AI · 3d ago6/10
🧠Researchers have developed Regression Language Models (RLMs) that use frozen LLM encoders to predict numeric code execution outcomes across multiple programming languages and domains. A 300M parameter model demonstrates strong performance predicting memory footprint, GPU latency, neural network accuracy, and hardware platform performance without domain-specific feature engineering.