MemQ: Integrating Q-Learning into Self-Evolving Memory Agents over Provenance DAGs
Researchers introduce MemQ, a novel framework that applies Q-learning eligibility traces to episodic memory in large language model agents, enabling credit assignment across memory dependencies recorded in provenance DAGs. The approach achieves superior performance across six diverse benchmarks, with gains up to 5.7 percentage points on multi-step tasks requiring deep memory chains.
MemQ addresses a fundamental limitation in current LLM agent architectures: the inability to properly evaluate which memories contribute to successful future outcomes. Traditional episodic memory systems treat each stored experience in isolation, missing the critical insight that memories form dependency chains where one memory enables the creation of another. By mapping these relationships through provenance directed acyclic graphs, MemQ applies temporal difference learning concepts to assign credit backward through memory hierarchies, with influence decaying based on structural distance rather than arbitrary temporal windows.
This work emerges from the broader effort to make LLM agents more capable of learning and improving from experience. As language models increasingly serve as autonomous agents in complex environments—from operating systems to code generation—their ability to effectively leverage accumulated knowledge becomes essential for performance gains. The Exogenous-Context MDP formalization elegantly separates task streams from internal memory dynamics, providing theoretical grounding for the empirical approach.
The experimental validation across six distinct domains demonstrates genuine generalization: operating system interaction, function calling, code generation, multimodal reasoning, embodied reasoning, and expert-level QA. The performance gains scale meaningfully with task complexity—largest improvements appear in multi-step scenarios generating rich provenance chains, while single-step classification tasks see minimal gains. This pattern validates the core hypothesis that credit assignment through memory dependencies drives improvement.
The framework's parameterization through gamma and lambda decay factors opens pathways for continued optimization. Future research will likely explore dynamic parameter adjustment based on task structure, integration with retrieval-augmented generation systems, and scaling to agents with extensive memory archives. The promised code release may accelerate adoption across AI research communities developing more sophisticated autonomous agents.
- →MemQ uses Q-learning eligibility traces to propagate credit through memory provenance DAGs, connecting memories that enable future memory creation
- →Performance gains range from +5.7pp on multi-step tasks with deep memory chains to +0.77pp on single-step tasks, validating the dependency-based approach
- →The Exogenous-Context MDP formalization separates exogenous task streams from endogenous memory stores, enabling principled credit assignment
- →Evaluation across six benchmarks spanning OS interaction, code generation, and expert QA demonstrates broad applicability and generalization capability
- →Parameter guidance for gamma and lambda decay factors provides actionable insights for practitioners implementing memory-augmented LLM agents