y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#memory-optimization News & Analysis

25 articles tagged with #memory-optimization. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

25 articles
AIBullisharXiv – CS AI · 2d ago7/10
🧠

IceCache: Memory-efficient KV-cache Management for Long-Sequence LLMs

IceCache is a new memory management technique for large language models that reduces KV cache memory consumption by 75% while maintaining 99% accuracy on long-sequence tasks. The method combines semantic token clustering with PagedAttention to intelligently offload cache data between GPU and CPU, addressing a critical bottleneck in LLM inference on resource-constrained hardware.

AIBullisharXiv – CS AI · Mar 267/10
🧠

Moonwalk: Inverse-Forward Differentiation

Researchers introduce Moonwalk, a new algorithm that solves backpropagation's memory limitations by eliminating the need to store intermediate activations during neural network training. The method uses vector-inverse-Jacobian products and submersive networks to reconstruct gradients in a forward sweep, enabling training of networks more than twice as deep under the same memory constraints.

AIBullishDecrypt · Mar 257/10
🧠

Google Shrinks AI Memory With No Accuracy Loss—But There's a Catch

Google has developed a technique that significantly reduces memory requirements for running large language models as context windows expand, without compromising accuracy. This breakthrough addresses a major constraint in AI deployment, though the article suggests there are limitations to the approach.

Google Shrinks AI Memory With No Accuracy Loss—But There's a Catch
AIBullisharXiv – CS AI · Mar 167/10
🧠

LightMoE: Reducing Mixture-of-Experts Redundancy through Expert Replacing

Researchers introduce LightMoE, a new framework that compresses Mixture-of-Experts language models by replacing redundant expert modules with parameter-efficient alternatives. The method achieves 30-50% compression rates while maintaining or improving performance, addressing the substantial memory demands that limit MoE model deployment.

AIBullisharXiv – CS AI · Mar 117/10
🧠

ARKV: Adaptive and Resource-Efficient KV Cache Management under Limited Memory Budget for Long-Context Inference in LLMs

Researchers propose ARKV, a new framework for managing memory in large language models that reduces KV cache memory usage by 4x while preserving 97% of baseline accuracy. The adaptive system dynamically allocates precision levels to cached tokens based on attention patterns, enabling more efficient long-context inference without requiring model retraining.

AIBullisharXiv – CS AI · Mar 67/10
🧠

Agent Memory Below the Prompt: Persistent Q4 KV Cache for Multi-Agent LLM Inference on Edge Devices

Researchers developed a memory management system for multi-agent AI systems on edge devices that reduces memory requirements by 4x through 4-bit quantization and eliminates redundant computation by persisting KV caches to disk. The solution reduces time-to-first-token by up to 136x while maintaining minimal impact on model quality across three major language model architectures.

🏢 Perplexity🧠 Llama
AIBullisharXiv – CS AI · Mar 46/103
🧠

AlphaFree: Recommendation Free from Users, IDs, and GNNs

Researchers propose AlphaFree, a novel recommender system that eliminates traditional dependencies on user embeddings, raw IDs, and graph neural networks. The system achieves up to 40% performance improvements while reducing GPU memory usage by up to 69% through language representations and contrastive learning.

AIBullisharXiv – CS AI · Mar 37/104
🧠

ROMA: a Read-Only-Memory-based Accelerator for QLoRA-based On-Device LLM

Researchers propose ROMA, a new hardware accelerator for running large language models on edge devices using QLoRA. The system uses ROM storage for quantized base models and SRAM for LoRA weights, achieving over 20,000 tokens/s generation speed without external memory.

AINeutralarXiv – CS AI · Mar 37/104
🧠

Not All Models Suit Expert Offloading: On Local Routing Consistency of Mixture-of-Expert Models

Researchers analyzed 20 Mixture-of-Experts (MoE) language models to study local routing consistency, finding a trade-off between routing consistency and local load balance. The study introduces new metrics to measure how well expert offloading strategies can optimize memory usage on resource-constrained devices while maintaining inference speed.

AIBullisharXiv – CS AI · Mar 37/105
🧠

Expressive Power of Implicit Models: Rich Equilibria and Test-Time Scaling

Researchers provide mathematical proof that implicit models can achieve greater expressive power through increased test-time computation, explaining how these memory-efficient architectures can match larger explicit networks. The study validates this scaling property across image reconstruction, scientific computing, operations research, and LLM reasoning domains.

AIBullisharXiv – CS AI · Mar 37/103
🧠

Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs

Researchers propose TRIM-KV, a novel approach that learns token importance for memory-bounded LLM inference through lightweight retention gates, addressing the quadratic cost of self-attention and growing key-value cache issues. The method outperforms existing eviction baselines across multiple benchmarks and provides insights into LLM interpretability through learned retention scores.

AIBullisharXiv – CS AI · Mar 37/102
🧠

ButterflyMoE: Sub-Linear Ternary Experts via Structured Butterfly Orbits

ButterflyMoE introduces a breakthrough approach to reduce memory requirements for AI expert models by 150× through geometric parameterization instead of storing independent weight matrices. The method uses shared ternary prototypes with learned rotations to achieve sub-linear memory scaling, enabling deployment of multiple experts on edge devices.

AIBullisharXiv – CS AI · Feb 277/108
🧠

FlashOptim: Optimizers for Memory Efficient Training

FlashOptim introduces memory optimization techniques that reduce AI training memory requirements by over 50% per parameter while maintaining model quality. The suite reduces AdamW memory usage from 16 bytes to 7 bytes per parameter through improved master weight splitting and 8-bit optimizer state quantization.

AIBullisharXiv – CS AI · 1d ago6/10
🧠

Aethon: A Reference-Based Replication Primitive for Constant-Time Instantiation of Stateful AI Agents

Aethon is a new systems primitive that enables stateful AI agents to be instantiated in near-constant time by using reference-based replication instead of full materialization. This architectural innovation addresses latency and memory overhead constraints in existing AI runtime systems, making it possible to spawn, specialize, and govern agents at production scale.

AINeutralarXiv – CS AI · 3d ago6/10
🧠

Artifacts as Memory Beyond the Agent Boundary

Researchers formalize how agents can use environmental artifacts as external memory to reduce computational requirements in reinforcement learning tasks. The study demonstrates that spatial observations can implicitly serve as memory substitutes, allowing agents to learn effective policies with less internal memory capacity than previously thought necessary.

AIBullisharXiv – CS AI · Apr 76/10
🧠

REAM: Merging Improves Pruning of Experts in LLMs

Researchers propose REAM (Router-weighted Expert Activation Merging), a new method for compressing large language models that groups and merges expert weights instead of pruning them. The technique preserves model performance better than existing pruning methods while reducing memory requirements for deployment.

AIBullisharXiv – CS AI · Mar 176/10
🧠

Self-Indexing KVCache: Predicting Sparse Attention from Compressed Keys

Researchers propose a novel self-indexing KV cache system that unifies compression and retrieval for efficient sparse attention in large language models. The method uses 1-bit vector quantization and integrates with FlashAttention to reduce memory bottlenecks in long-context LLM inference.

AIBullisharXiv – CS AI · Mar 126/10
🧠

LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation

Researchers have developed LookaheadKV, a new framework that significantly improves memory efficiency in large language models by intelligently evicting less important cached data. The method achieves superior accuracy while reducing computational costs by up to 14.5x compared to existing approaches, making long-context AI tasks more practical.

AIBullisharXiv – CS AI · Mar 37/108
🧠

MemPO: Self-Memory Policy Optimization for Long-Horizon Agents

Researchers propose MemPO (Self-Memory Policy Optimization), a new algorithm that enables AI agents to autonomously manage their memory during long-horizon tasks. The method achieves significant performance improvements with 25.98% F1 score gains over base models while reducing token usage by 67.58%.

AIBullisharXiv – CS AI · Mar 36/103
🧠

FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding

FluxMem is a new training-free framework for streaming video understanding that uses hierarchical memory compression to reduce computational costs. The system achieves state-of-the-art performance on video benchmarks while reducing latency by 69.9% and GPU memory usage by 34.5%.

AIBullisharXiv – CS AI · Mar 27/1012
🧠

MEGS$^{2}$: Memory-Efficient Gaussian Splatting via Spherical Gaussians and Unified Pruning

Researchers introduce MEGS², a new memory-efficient framework for 3D Gaussian Splatting that reduces memory consumption by 50% for static rendering and 40% for real-time rendering. The breakthrough enables 3D rendering on edge devices by replacing memory-intensive spherical harmonics with lightweight spherical Gaussian lobes and implementing unified pruning optimization.

AIBullishHugging Face Blog · Jul 306/105
🧠

Memory-efficient Diffusion Transformers with Quanto and Diffusers

The article discusses memory-efficient implementation of Diffusion Transformers using Quanto quantization library integrated with Diffusers. This technical advancement enables running large-scale AI image generation models with reduced memory requirements, making them more accessible for deployment.

AINeutralarXiv – CS AI · Feb 274/108
🧠

Generalized Rapid Action Value Estimation in Memory-Constrained Environments

Researchers introduce GRAVE2, GRAVER and GRAVER2 algorithms that extend Generalized Rapid Action Value Estimation (GRAVE) for game playing AI. These new variants dramatically reduce memory requirements while maintaining the same playing strength as the original GRAVE algorithm.