#memory-efficiency News & Analysis

60 articles tagged with #memory-efficiency. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

60 articles

AIBullisharXiv – CS AI · Jun 237/10

🧠

Latent Personal Memory: Represent personal memory as dynamic soft prompts

Researchers introduce Latent Personal Memory (LPM), a framework that personalizes large language models by encoding user-specific behavioral patterns as compact, interpretable latent slots converted into dynamic soft prompts. The approach achieves significant efficiency gains—outperforming LoRA and Prompt Tuning by up to 54.4% on benchmarks while reducing memory usage by 64x—making personalized LLMs more practical for deployment.

AIBullisharXiv – CS AI · Jun 237/10

🧠

Keyless Attention: Value-Space Routing and Value-Only Caching for Efficient Transformers

Researchers propose Keyless Attention, a transformer mechanism that eliminates key projections to reduce KV cache memory by 50% while maintaining or improving performance across multiple model architectures. The approach introduces a value-space routing matrix that replaces the traditional key projection, demonstrating competitive results on perplexity and downstream benchmarks.

🏢 Perplexity🧠 Llama

AIBullisharXiv – CS AI · Jun 237/10

🧠

Memory Is No Longer a Bottleneck: Memory-Efficient Graph Filtering for Scalable Collaborative Filtering

Researchers have developed Mem-GF, a memory-efficient graph filtering method for collaborative filtering that eliminates the need to store full item similarity graphs. The approach uses Krylov subspaces to approximate polynomial graph filters, achieving 5.74× lower memory usage and 4.38× faster runtime while maintaining or exceeding recommendation accuracy of existing methods.

AINeutralarXiv – CS AI · Jun 107/10

🧠

Unifying Data, Memory, and Compute Efficiency in LLM training: A Survey

A comprehensive survey examines how data efficiency, memory constraints, and compute budgets interact as coupled bottlenecks in LLM training. The research reveals that optimal training strategies are resource-dependent rather than universal, with GPU memory often being the primary limiting factor rather than raw computational power.

AIBullisharXiv – CS AI · Jun 107/10

🧠

Sigma-Branch: Hierarchical Single-Path Network Reconstruction for Dynamic Inference with Reduced Active Parameters

Researchers introduce Sigma-Branch, a neural network restructuring framework that reduces per-inference active parameters by 58-60% while maintaining full model capacity in memory. The approach uses hierarchical routing and binary tree architecture to enable efficient edge deployment without permanent model compression trade-offs.

AIBullisharXiv – CS AI · Jun 107/10

🧠

From Volume to Value: Preference-Aligned Memory Construction for On-Device RAG

Researchers introduce EPIC, a novel approach to on-device Retrieval-Augmented Generation (RAG) that prioritizes user preferences as compact personal context while operating under strict memory constraints. The method achieves dramatic efficiency gains—reducing memory usage by 2,404x and latency by 32x—while improving preference-following accuracy by 18.79 percentage points across multiple benchmarks.

AIBullisharXiv – CS AI · Jun 107/10

🧠

IntentKV: Cross-Turn Intent-Aware KV Cache Pruning for Agent Inference

Researchers introduce IntentKV, a learned KV cache pruning technique that optimizes memory usage for multi-turn LLM agents without modifying the base model. The method achieves 23-30% reductions in peak request tokens and up to 92.6% fewer KV reads under tight memory budgets, addressing a critical bottleneck in long-horizon agent inference.

AIBullisharXiv – CS AI · Jun 97/10

🧠

Attention at the Theoretical Minimum: A Mathematics of Arrays Framework for Memory-Optimal Transformer Kernels

Researchers present a Mathematics of Arrays framework that optimizes transformer attention mechanisms to achieve near-theoretical minimum memory requirements, reducing data movement from O(n²) to O(n) complexity. The approach delivers formal mathematical proofs of memory optimality and projects 2-100x speedup improvements, addressing a critical computational bottleneck in AI systems.

AIBullisharXiv – CS AI · Jun 57/10

🧠

Channel-Wise Mixed-Precision Quantization for Large Language Models

Researchers introduce Channel-Wise Mixed-Precision Quantization (CMPQ), a novel technique that reduces Large Language Model memory requirements by assigning different precision levels to different weight channels based on activation patterns. The method enables fractional-bit quantization between 2-4 bits while preserving critical information through outlier extraction, addressing deployment constraints on edge devices.

AIBullisharXiv – CS AI · Jun 57/10

🧠

RedKnot: Efficient Long-Context LLM Serving with Head-Aware KV Reuse and SegPagedAttention

RedKnot is a new KV cache management system for large language models that optimizes memory efficiency by treating cache differently across attention heads rather than as a uniform block. This head-aware approach enables better resource utilization, higher serving concurrency, and improved scalability without requiring model retraining.

AIBullisharXiv – CS AI · Jun 57/10

🧠

ABBEL: Learning Natural-Language Belief States for Memory-Efficient Interaction

ABBEL is a new recursive summarization framework that enables AI agents to maintain memory-efficient interaction histories by storing information as natural-language belief states rather than full context. The approach uses reinforcement learning techniques to improve belief generation quality, achieving 40% better performance than prior memory-constrained agents while using 67% less memory.

AI × CryptoBullishCrypto Briefing · Jun 17/10

🤖

Tether AI open-sources TurboQuant, reducing LLM KV cache memory use by 5x

Tether AI has open-sourced TurboQuant, a technology that reduces large language model KV cache memory consumption by 5x. The release aims to democratize AI development by enabling efficient local deployment and reducing dependence on centralized cloud infrastructure.

AI × CryptoBullishCrypto Briefing · Jun 17/10

🤖

Tether releases open source version of Google’s TurboQuant to cut AI memory use

Tether has released an open-source version of Google's TurboQuant, a technology designed to reduce AI memory consumption. This move aims to decentralize AI development by enabling local devices to run sophisticated AI models without relying on centralized cloud infrastructure.

AIBullisharXiv – CS AI · Jun 17/10

🧠

On Efficient Scaling of GNNs via IO-Aware Layers Implementations

Researchers develop GPU kernel optimizations for Graph Neural Networks that reduce memory traffic and improve computational efficiency across three major layer types. The work achieves significant speedups (up to 8.5x for GATv2, 10x for aggregation layers) while dramatically reducing memory consumption, with implementations released as drop-in replacements for existing frameworks.

AIBullisharXiv – CS AI · May 297/10

🧠

Moment-KV: Momentum-Based Decode-Time KV Cache Compression for Long Generation

Researchers introduce Moment-KV, a momentum-based compression technique that optimizes Key-Value cache usage during LLM decoding phases. The method improves long-generation task performance by 2.3-3.2% while maintaining latency by dynamically tracking token importance through temporal attention patterns rather than static heuristics.

AIBullisharXiv – CS AI · May 287/10

🧠

Pruning and Distilling Mixture-of-Experts into Dense Language Models

Researchers present a framework for converting Mixture-of-Experts (MoE) language models into standard dense architectures through expert selection, grouping, and knowledge distillation. The method achieves superior performance compared to traditional dense-to-dense pruning while enabling deployment on memory-constrained systems.

AIBullisharXiv – CS AI · May 287/10

🧠

FD-RAG: Federated Dual-System Retrieval-Augmented Generation

FD-RAG introduces a federated framework for retrieval-augmented generation that enables decentralized LLM deployment across edge devices without centralizing sensitive data. The system achieves 7.8% accuracy improvements and 8.4x latency reductions by splitting lightweight memory access from expensive LLM reasoning, while aggregating anonymized knowledge across fragmented device networks.

AIBullisharXiv – CS AI · May 277/10

🧠

ReMoE: Boosting Expert Reuse through Router Fine-Tuning in Memory-Constrained MoE LLM Inference

Researchers introduce ReMoE, a router fine-tuning framework that optimizes Mixture-of-Experts language models for memory-constrained inference by increasing expert reuse and reducing storage I/O overhead. The approach improves expert reuse by 26% while maintaining performance, delivering up to 1.99× decode speedup on edge devices.

AIBullisharXiv – CS AI · May 127/10

🧠

SWIFT: Prompt-Adaptive Memory for Efficient Interactive Long Video Generation

SWIFT is a new training-free framework for generating long videos with multiple prompt changes, addressing the challenge of maintaining visual coherence while rapidly adapting to semantic shifts. The system achieves 22.6 FPS on single H100 GPUs by using adaptive memory management and selective attention updates, rather than rebuilding cached memory at each prompt boundary.

AIBullisharXiv – CS AI · May 117/10

🧠

Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models

Researchers introduce Memory-Efficient Looped Transformer (MELT), an architecture that decouples reasoning depth from memory consumption in recurrent language models. MELT replaces the standard approach of maintaining separate Key-Value caches per reasoning loop with a single shared cache per layer, updated via learnable gating, achieving constant-memory iterative reasoning comparable to standard LLMs while outperforming them on benchmarks.

AIBullisharXiv – CS AI · May 117/10

🧠

ESSAM: A Novel Competitive Evolution Strategies Approach to Reinforcement Learning for Memory Efficient LLMs Fine-Tuning

Researchers propose ESSAM, a novel training framework combining Evolution Strategies with Sharpness-Aware Maximization to fine-tune large language models for mathematical reasoning while dramatically reducing GPU memory requirements. The approach achieves comparable accuracy to reinforcement learning methods like PPO and GRPO while using 18-10× less memory, addressing a critical bottleneck in LLM development.

AIBullisharXiv – CS AI · May 97/10

🧠

Sparse Prefix Caching for Hybrid and Recurrent LLM Serving

Researchers propose sparse prefix caching, a novel optimization technique for hybrid and recurrent LLM serving that stores exact states at checkpoint positions rather than caching entire token histories. The method uses dynamic programming to determine optimal checkpoint placement and demonstrates superior performance on real-world datasets while using fewer checkpoints than existing dense caching approaches.

AIBullisharXiv – CS AI · May 47/10

🧠

AdaMeZO: Adam-style Zeroth-Order Optimizer for LLM Fine-tuning Without Maintaining the Moments

Researchers introduce AdaMeZO, a new zeroth-order optimizer that combines the memory efficiency of MeZO with Adam-style moment estimation for fine-tuning large language models. The method achieves faster convergence than MeZO while reducing GPU memory requirements and requiring up to 70% fewer forward passes.

AIBullisharXiv – CS AI · May 47/10

🧠

Make Your LVLM KV Cache More Lightweight

Researchers propose LightKV, a technique that reduces Key-Value cache memory overhead in Large Vision-Language Models by compressing vision tokens using cross-modality message passing guided by text prompts. The method achieves 50% reduction in KV cache size while using only 55% of original vision tokens and reducing computation by up to 40%, maintaining performance across eight benchmark datasets.

AIBullisharXiv – CS AI · Apr 107/10

🧠

Efficient Quantization of Mixture-of-Experts with Theoretical Generalization Guarantees

Researchers propose an expert-wise mixed-precision quantization strategy for Mixture-of-Experts models that assigns bit-widths based on router gradient changes and neuron variance. The method achieves higher accuracy than existing approaches while reducing inference memory overhead on large-scale models like Switch Transformer and Mixtral with minimal computational overhead.

Page 1 of 3Next →