#memory-management News & Analysis

38 articles tagged with #memory-management. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

38 articles

AIBullisharXiv – CS AI · Jun 237/10

🧠

Geometry-Aware Online Scheduling for LLM Serving: From Theoretical Bound to System Practice

Researchers propose Geometry-Aware Online Scheduling, introducing the Smallest Volume First (SVF) algorithm to optimize LLM inference by accounting for dynamic memory footprint of Key-Value caches. The approach improves upon traditional time-centric scheduling heuristics, achieving significant reductions in latency and throughput gains when integrated into vLLM.

🧠 Llama

AIBullisharXiv – CS AI · Jun 237/10

🧠

StackPlanner: A Centralized Hierarchical Multi-Agent System with Task-Experience Memory Management

StackPlanner introduces a hierarchical multi-agent system that improves coordination among large language model-based agents through explicit memory management and reusable experience learning. The framework addresses critical limitations in centralized multi-agent architectures by decoupling high-level coordination from task execution and enabling agents to retain and leverage past coordination strategies, demonstrating improved performance on complex benchmarks.

AIBullisharXiv – CS AI · Jun 97/10

🧠

MemToolAgent overview with a simple restaurant booking scenario where the agent retrieves similar memories, receives feedback on an invalid time format, and generates a reflection to update its memory

Researchers introduce MemToolAgent, a framework that enhances LLM agents' ability to use tools effectively by implementing memory management systems that store and retrieve past experiences. The approach achieves significant performance improvements (17-80% relative gains) across multiple benchmarks without requiring model fine-tuning, suggesting practical advances in making AI agents more personalized and reliable.

AIBullisharXiv – CS AI · Jun 57/10

🧠

Beyond Semantic Organization: Memory as Execution State Management for Long-Horizon Agents

Researchers introduce MAGE, a novel memory management system for LLM-based agents that organizes task histories as hierarchical state trees rather than semantic similarity clusters. The approach achieves 7.8-20.4 percentage point improvements in task success rates while reducing token consumption by 55.1% on long-horizon tasks with interdependent decisions.

AIBullisharXiv – CS AI · May 297/10

🧠

VikingMem: A Memory Base Management System for Stateful LLM-based Applications

Researchers introduce VikingMem, a memory management system for long-term LLM interactions that addresses context window limitations through selective memory extraction, stateful evolution, and temporal weighting. The system demonstrates 30% improvements in memory retrieval effectiveness while maintaining low latency, offering a generalizable solution across diverse applications beyond traditional chatbots.

AIBullisharXiv – CS AI · Apr 147/10

🧠

Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading

Researchers introduce Deep Optimizer States, a technique that reduces GPU memory constraints during large language model training by dynamically offloading optimizer state between host and GPU memory during computation cycles. The method achieves 2.5× faster iterations compared to existing approaches by better managing the memory fluctuations inherent in transformer training pipelines.

AIBullisharXiv – CS AI · Apr 77/10

🧠

LightThinker++: From Reasoning Compression to Memory Management

Researchers developed LightThinker++, a new framework that enables large language models to compress intermediate reasoning thoughts and manage memory more efficiently. The system reduces peak token usage by up to 70% while improving accuracy by 2.42% and maintaining performance over extended reasoning tasks.

AIBullisharXiv – CS AI · Mar 267/10

🧠

ODMA: On-Demand Memory Allocation Strategy for LLM Serving on LPDDR-Class Accelerators

Researchers developed ODMA, a new memory allocation strategy that improves Large Language Model serving performance on memory-constrained accelerators by up to 27%. The technique addresses bandwidth limitations in LPDDR systems through adaptive bucket partitioning and dynamic generation-length prediction.

AIBullisharXiv – CS AI · Mar 177/10

🧠

StatePlane: A Cognitive State Plane for Long-Horizon AI Systems Under Bounded Context

Researchers introduce StatePlane, a model-agnostic cognitive state management system that enables AI systems to maintain coherent reasoning over long interaction horizons without expanding context windows or retraining models. The system uses episodic, semantic, and procedural memory mechanisms inspired by cognitive psychology to overcome current limitations in large language models.

AIBullisharXiv – CS AI · Mar 177/10

🧠

Justitia: Fair and Efficient Scheduling of Task-parallel LLM Agents with Selective Pampering

Justitia is a new scheduling system for task-parallel LLM agents that optimizes GPU server performance through selective resource allocation based on completion order prediction. The system uses memory-centric cost quantification and virtual-time fair queuing to achieve both efficiency and fairness in LLM serving environments.

🏢 Meta

AIBullisharXiv – CS AI · Mar 127/10

🧠

MoE-SpAc: Efficient MoE Inference Based on Speculative Activation Utility in Heterogeneous Edge Scenarios

Researchers introduce MoE-SpAc, a new framework for efficient Mixture-of-Experts model inference on edge devices that achieves 42% improvement over existing baselines. The system uses speculative decoding as a memory management tool and demonstrates 4.04x average speedup across benchmarks.

AIBullisharXiv – CS AI · Mar 117/10

🧠

The Missing Memory Hierarchy: Demand Paging for LLM Context Windows

Researchers developed Pichay, a demand paging system that treats LLM context windows like computer memory with hierarchical caching. The system reduces context consumption by up to 93% in production by evicting stale content and managing memory more efficiently, addressing fundamental scalability issues in AI systems.

AIBullisharXiv – CS AI · Mar 67/10

🧠

AMV-L: Lifecycle-Managed Agent Memory for Tail-Latency Control in Long-Running LLM Systems

Researchers introduce AMV-L, a new memory management framework for long-running LLM systems that uses utility-based lifecycle management instead of traditional time-based retention. The system improves throughput by 3.1x and reduces latency by up to 4.7x while maintaining retrieval quality by controlling memory working-set size rather than just retention time.

AIBullisharXiv – CS AI · Mar 47/102

🧠

Neural Paging: Learning Context Management Policies for Turing-Complete Agents

Researchers introduce Neural Paging, a new architecture that addresses the computational bottleneck of finite context windows in Large Language Models by implementing a hierarchical system that decouples reasoning from memory management. The approach reduces computational complexity from O(N²) to O(N·K²) for long-horizon reasoning tasks, potentially enabling more efficient AI agents.

AIBullisharXiv – CS AI · Mar 37/103

🧠

FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference

Researchers introduce FreeKV, a training-free optimization framework that dramatically improves KV cache retrieval efficiency for large language models with long context windows. The system achieves up to 13x speedup compared to existing methods while maintaining near-lossless accuracy through speculative retrieval and hybrid memory layouts.

$NEAR

AIBullisharXiv – CS AI · Feb 277/107

🧠

Contextual Memory Virtualisation: DAG-Based State Management and Structurally Lossless Trimming for LLM Agents

Researchers introduce Contextual Memory Virtualisation (CMV), a system that preserves LLM understanding across extended sessions by treating context as version-controlled state using DAG-based management. The system includes a trimming algorithm that reduces token counts by 20-86% while preserving all user interactions, demonstrating particular efficiency in tool-use sessions.

AIBullisharXiv – CS AI · Jun 236/10

🧠

Learning What Not to Forget: Long-Horizon Agent Memory from a Few Kilobytes of Learning

Researchers present LRE (Learned Relevance Eviction), a lightweight memory management system for long-running language model agents that intelligently decides which historical information to retain when context windows fill up. The approach uses a small, CPU-based scorer to identify critical details like access tokens and task-relevant information, achieving comparable accuracy to keeping full history while reducing peak context size by up to 52% and requiring significantly fewer computational calls.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Decision-Aware Memory Cards: Counterfactual-Inspired Context Selection and Compression for Tool-Using LLM Agents

Researchers introduce CICL, a decision-aware context layer that improves how language model agents select and compress relevant information for tool use. By scoring evidence based on action criticality and packing high-utility data as typed memory cards, the system achieves significant performance gains on code retrieval benchmarks, raising hit rates from 58% to 78% on SWE-bench tasks.

🧠 GPT-5

AINeutralarXiv – CS AI · Jun 56/10

🧠

Beyond Similarity: Trustworthy Memory Search for Personal AI Agents

Researchers propose MemGate, a security-focused plugin that addresses critical vulnerabilities in personal AI agent memory systems. While semantic similarity-based memory retrieval improves personalization, it can inadvertently enable cross-domain data leakage, jailbreaks, and erratic behavior—risks that MemGate mitigates through task-conditioned memory filtering without requiring LLM modifications.

AIBullisharXiv – CS AI · Jun 16/10

🧠

SAGE: A Novelty Gate for Efficient Memory Evolution in Agentic LLMs

Researchers introduce SAGE, a memory management system for agentic LLMs that uses novelty detection to efficiently control when new facts are added, merged, or ignored. The approach reduces API costs and latency by 3.4× and 2.5× respectively while maintaining quality, addressing a critical gap in write-side memory control for long-context AI agents.

🧠 GPT-4

AINeutralarXiv – CS AI · May 296/10

🧠

Context Distillation as Latent Memory Management

Researchers propose a novel approach to context distillation that treats compressed contextual information as a latent memory management problem, using modular LoRA adapters with intelligent retrieval and self-gating mechanisms to improve efficiency and robustness in machine learning systems.

AIBullisharXiv – CS AI · May 126/10

🧠

KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving

Researchers present KV-RM, a runtime optimization that manages KV-cache memory movement in static-graph LLM decoders, achieving better throughput and reduced latency variability without sacrificing the predictability benefits of static graph execution. The approach decouples logical KV histories from physical storage through a block pager and merge-staged transport mechanism, demonstrating practical improvements on multi-GPU systems.

🏢 Nvidia

AINeutralarXiv – CS AI · May 116/10

🧠

MEMOREPAIR: Barrier-First Cascade Repair in Agentic Memory

Researchers introduce MemoRepair, a system that addresses cascade failures in agentic memory by preventing stale or invalidated information from corrupting downstream AI agent decisions. Using a barrier-first approach and graph-based optimization, the system reduces invalid memory exposure from 69-94% to 0% while maintaining 91-94% of valid successor states with significantly lower repair costs.

AIBullisharXiv – CS AI · May 116/10

🧠

Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding

SAVEMem is a training-free framework that improves real-time video understanding by incorporating semantic awareness into memory management rather than relying solely on visual similarity. The system achieves significant performance gains on streaming video benchmarks while reducing GPU memory consumption by 48%, demonstrating practical advances in efficient AI model inference.

AINeutralarXiv – CS AI · May 46/10

🧠

MemRouter: Memory-as-Embedding Routing for Long-Term Conversational Agents

MemRouter is a new memory management system for conversational AI agents that uses lightweight embedding-based routing instead of expensive LLM generation to decide which conversation turns to store. The approach achieves 52.0 F1 score versus 45.6 for LLM-based alternatives while reducing latency from 970ms to 58ms, suggesting memory admission can be effectively learned through supervised classification rather than generative models.

Page 1 of 2Next →