AIBullisharXiv – CS AI · 1d ago7/10
🧠Researchers introduce SHAPE, a novel expert pruning framework for Sparse Mixture-of-Experts (MoE) language models that reduces memory requirements by up to 40% without retraining. Unlike traditional pruning methods that evaluate experts independently, SHAPE models expert cooperation using game theory, identifying which expert combinations matter most for model performance.
AIBullisharXiv – CS AI · Jun 27/10
🧠BudgetDraft is a new training method for sparse-KV speculative decoding that enables faster language model inference under memory constraints. By training drafters to handle multiple KV cache budgets simultaneously, the technique achieves up to 6.55x speedup on mid-to-long context inference while maintaining acceptance rates and reducing GPU memory usage.
AIBullisharXiv – CS AI · Jun 27/10
🧠Researchers propose LU-KV, a novel framework for optimizing KV cache eviction in large language models by formulating budget allocation as a combinatorial optimization problem. The approach reduces KV cache size by 80% while maintaining performance, significantly lowering inference latency and GPU memory requirements.
AIBullisharXiv – CS AI · May 47/10
🧠Researchers propose LightKV, a technique that reduces Key-Value cache memory overhead in Large Vision-Language Models by compressing vision tokens using cross-modality message passing guided by text prompts. The method achieves 50% reduction in KV cache size while using only 55% of original vision tokens and reducing computation by up to 40%, maintaining performance across eight benchmark datasets.
AIBullisharXiv – CS AI · Apr 147/10
🧠Researchers introduce AtlasKV, a parametric knowledge integration method that enables large language models to leverage billion-scale knowledge graphs while consuming less than 20GB of VRAM. Unlike traditional retrieval-augmented generation (RAG) approaches, AtlasKV integrates knowledge directly into LLM parameters without requiring external retrievers or extended context windows, reducing inference latency and computational overhead.
AIBullisharXiv – CS AI · Mar 117/10
🧠Researchers propose ARKV, a new framework for managing memory in large language models that reduces KV cache memory usage by 4x while preserving 97% of baseline accuracy. The adaptive system dynamically allocates precision levels to cached tokens based on attention patterns, enabling more efficient long-context inference without requiring model retraining.
AIBullisharXiv – CS AI · Mar 47/103
🧠Nightjar is a new adaptive speculative decoding framework for large language models that dynamically adjusts to system load conditions. It achieves 27.29% higher throughput and up to 20.18% lower latency by intelligently enabling or disabling speculation based on workload demands.
AIBullishBlockonomi · May 296/10
🧠Samsung shipped its first HBM4E memory chips, delivering 20% speed improvements over previous generations, driving the stock up 5.8%. The shipment signals progress in high-bandwidth memory technology critical for AI infrastructure, with the HBM market projected to reach $76 billion by 2025.
AIBullisharXiv – CS AI · May 276/10
🧠Researchers present a new quantization method for large video diffusion models that achieves 59.3% memory reduction while maintaining near-baseline quality. The technique addresses challenges in compressing Wan2.2-I2V's mixture-of-experts architecture by using timestep-aware and expert-specific calibration strategies.
AINeutralHugging Face Blog · Dec 244/106
🧠The article appears to be a technical guide focused on visualizing and understanding GPU memory usage in PyTorch, a popular machine learning framework. This type of content typically helps developers optimize their AI model training and deployment by better managing memory resources.