AIBullisharXiv – CS AI · Apr 147/10
🧠Researchers introduce Deep Optimizer States, a technique that reduces GPU memory constraints during large language model training by dynamically offloading optimizer state between host and GPU memory during computation cycles. The method achieves 2.5× faster iterations compared to existing approaches by better managing the memory fluctuations inherent in transformer training pipelines.
AIBullisharXiv – CS AI · Apr 77/10
🧠Researchers developed LightThinker++, a new framework that enables large language models to compress intermediate reasoning thoughts and manage memory more efficiently. The system reduces peak token usage by up to 70% while improving accuracy by 2.42% and maintaining performance over extended reasoning tasks.
AIBullisharXiv – CS AI · Mar 267/10
🧠Researchers developed ODMA, a new memory allocation strategy that improves Large Language Model serving performance on memory-constrained accelerators by up to 27%. The technique addresses bandwidth limitations in LPDDR systems through adaptive bucket partitioning and dynamic generation-length prediction.
AIBullisharXiv – CS AI · Mar 177/10
🧠Researchers introduce StatePlane, a model-agnostic cognitive state management system that enables AI systems to maintain coherent reasoning over long interaction horizons without expanding context windows or retraining models. The system uses episodic, semantic, and procedural memory mechanisms inspired by cognitive psychology to overcome current limitations in large language models.
AIBullisharXiv – CS AI · Mar 177/10
🧠Justitia is a new scheduling system for task-parallel LLM agents that optimizes GPU server performance through selective resource allocation based on completion order prediction. The system uses memory-centric cost quantification and virtual-time fair queuing to achieve both efficiency and fairness in LLM serving environments.
🏢 Meta
AIBullisharXiv – CS AI · Mar 127/10
🧠Researchers introduce MoE-SpAc, a new framework for efficient Mixture-of-Experts model inference on edge devices that achieves 42% improvement over existing baselines. The system uses speculative decoding as a memory management tool and demonstrates 4.04x average speedup across benchmarks.
AIBullisharXiv – CS AI · Mar 117/10
🧠Researchers developed Pichay, a demand paging system that treats LLM context windows like computer memory with hierarchical caching. The system reduces context consumption by up to 93% in production by evicting stale content and managing memory more efficiently, addressing fundamental scalability issues in AI systems.
AIBullisharXiv – CS AI · Mar 67/10
🧠Researchers introduce AMV-L, a new memory management framework for long-running LLM systems that uses utility-based lifecycle management instead of traditional time-based retention. The system improves throughput by 3.1x and reduces latency by up to 4.7x while maintaining retrieval quality by controlling memory working-set size rather than just retention time.
AIBullisharXiv – CS AI · Mar 47/102
🧠Researchers introduce Neural Paging, a new architecture that addresses the computational bottleneck of finite context windows in Large Language Models by implementing a hierarchical system that decouples reasoning from memory management. The approach reduces computational complexity from O(N²) to O(N·K²) for long-horizon reasoning tasks, potentially enabling more efficient AI agents.
AIBullisharXiv – CS AI · Mar 37/103
🧠Researchers introduce FreeKV, a training-free optimization framework that dramatically improves KV cache retrieval efficiency for large language models with long context windows. The system achieves up to 13x speedup compared to existing methods while maintaining near-lossless accuracy through speculative retrieval and hybrid memory layouts.
$NEAR
AIBullisharXiv – CS AI · Feb 277/107
🧠Researchers introduce Contextual Memory Virtualisation (CMV), a system that preserves LLM understanding across extended sessions by treating context as version-controlled state using DAG-based management. The system includes a trimming algorithm that reduces token counts by 20-86% while preserving all user interactions, demonstrating particular efficiency in tool-use sessions.
AIBullisharXiv – CS AI · May 126/10
🧠Researchers present KV-RM, a runtime optimization that manages KV-cache memory movement in static-graph LLM decoders, achieving better throughput and reduced latency variability without sacrificing the predictability benefits of static graph execution. The approach decouples logical KV histories from physical storage through a block pager and merge-staged transport mechanism, demonstrating practical improvements on multi-GPU systems.
🏢 Nvidia
AINeutralarXiv – CS AI · May 116/10
🧠Researchers introduce MemoRepair, a system that addresses cascade failures in agentic memory by preventing stale or invalidated information from corrupting downstream AI agent decisions. Using a barrier-first approach and graph-based optimization, the system reduces invalid memory exposure from 69-94% to 0% while maintaining 91-94% of valid successor states with significantly lower repair costs.
AIBullisharXiv – CS AI · May 116/10
🧠SAVEMem is a training-free framework that improves real-time video understanding by incorporating semantic awareness into memory management rather than relying solely on visual similarity. The system achieves significant performance gains on streaming video benchmarks while reducing GPU memory consumption by 48%, demonstrating practical advances in efficient AI model inference.
AINeutralarXiv – CS AI · May 46/10
🧠MemRouter is a new memory management system for conversational AI agents that uses lightweight embedding-based routing instead of expensive LLM generation to decide which conversation turns to store. The approach achieves 52.0 F1 score versus 45.6 for LLM-based alternatives while reducing latency from 970ms to 58ms, suggesting memory admission can be effectively learned through supervised classification rather than generative models.
AINeutralarXiv – CS AI · May 16/10
🧠Researchers introduce TiMem, a temporal-hierarchical memory framework that helps conversational AI agents manage long conversation histories beyond LLM context limits. The system organizes interactions through a Temporal Memory Tree, achieving state-of-the-art performance on memory recall benchmarks while reducing memory overhead by over 50%.
AIBullishMarkTechPost · Mar 157/10
🧠OpenViking is an open-source context database from Volcengine that revolutionizes how AI agents manage context by organizing it through a filesystem paradigm rather than flat text chunks. The system aims to make memory, resources, and skills manageable through a unified architecture for AI agent systems like OpenClaw.
AINeutralMicrosoft Research Blog · Mar 106/10
🧠Microsoft Research highlights a counterintuitive problem where giving AI agents more memory actually reduces their effectiveness. As interaction logs accumulate, they become large, filled with irrelevant content, and difficult to search through, making it harder for agents to find relevant information for current tasks.
AIBullisharXiv – CS AI · Mar 66/10
🧠Researchers propose Adaptive Memory Admission Control (A-MAC), a new framework for managing long-term memory in LLM-based agents. The system improves memory precision-recall by 31% while reducing latency through structured decision-making based on five interpretable factors rather than opaque LLM-driven policies.
AIBullisharXiv – CS AI · Mar 37/107
🧠Researchers have developed Semantic XPath, a tree-structured memory system for conversational AI that improves performance by 176.7% over traditional methods while using only 9.1% of the tokens. The system addresses scalability issues in long-term AI conversations by efficiently accessing and updating structured memory instead of appending growing conversation history.
AIBullisharXiv – CS AI · Mar 36/107
🧠Researchers propose ActMem, a novel memory framework for LLM agents that combines memory retrieval with active causal reasoning to handle complex decision-making scenarios. The framework transforms dialogue history into structured causal graphs and uses counterfactual reasoning to resolve conflicts between past states and current intentions, significantly outperforming existing baselines in memory-dependent tasks.
AINeutralarXiv – CS AI · Mar 36/104
🧠Researchers introduce AMemGym, an interactive benchmarking environment for evaluating and optimizing memory management in long-horizon conversations with AI assistants. The framework addresses limitations in current memory evaluation methods by enabling on-policy testing with LLM-simulated users and revealing performance gaps in existing memory systems like RAG and long-context LLMs.
AIBullisharXiv – CS AI · Mar 36/104
🧠OrbitFlow is a new KV cache management system for long-context LLM serving that uses adaptive memory allocation and fine-grained optimization to improve performance. The system achieves up to 66% better SLO attainment and 3.3x higher throughput by dynamically managing GPU memory usage during token generation.
AIBullisharXiv – CS AI · Mar 36/103
🧠Researchers have introduced PiKV, an open-source KV cache management framework designed to optimize memory and communication costs for Mixture of Experts (MoE) language models across multi-GPU and multi-node inference. The system uses expert-sharded storage, intelligent routing, adaptive scheduling, and compression to improve efficiency in large-scale AI model deployment.
AIBullisharXiv – CS AI · Mar 27/1011
🧠Researchers from PKU-SEC-Lab have developed KEEP, a new memory management system that significantly improves the efficiency of AI-powered embodied planning by optimizing KV cache usage. The system achieves 2.68x speedup compared to text-based memory methods while maintaining accuracy, addressing a key bottleneck in memory-augmented Large Language Models for complex planning tasks.