Less Context, More Accuracy: A Bi-Temporal Memory Engine for LLM Agents Where a Lean Retrieved Context Beats the Full History
Researchers introduce Engram, an open-source memory engine for LLM agents that achieves 83.6% accuracy on long-context tasks using only 9.6k tokens versus 79k for full-history baselines, demonstrating that selective retrieval outperforms exhaustive context replay while reducing computational costs by 8x.
Engram addresses a critical limitation in large language model agents: the inability to maintain accurate long-term memory across sessions without storing entire conversation histories. Traditional approaches either lose information when sessions end or replay full histories, creating computational bottlenecks and paradoxically reducing accuracy as irrelevant details accumulate as distractors. The system's dual-process architecture separates concerns effectively—fast writes capture raw episodes without LLM involvement, while asynchronous processing extracts structured facts into a bi-temporal knowledge graph that preserves provenance chains and handles contradictions through invalidation rather than deletion.
Engram's performance gains stem from its hybrid read path, which intelligently fuses multiple signal types: dense embeddings, lexical matching, graph topology, and temporal recency. By applying point-in-time filtering, the system retrieves only contextually relevant information, reducing token consumption while paradoxically improving accuracy. The 10.4-point improvement over full-context baselines on LongMemEval_S, with McNemar statistical significance at p < 10^-6, indicates this represents genuine progress rather than measurement artifacts.
For the AI infrastructure industry, this work signals a maturation in memory systems engineering. Rather than competing primarily on cost or latency, Engram demonstrates that intelligent retrieval beats naive concatenation. The contribution extends beyond Engram itself: the authors publish a neutral evaluation harness with the official judge included, raw per-question logs, and reproduction commands. This emphasis on measurement integrity directly challenges benchmark inflation in the field, where unreproducible configurations allow systems to report wildly inconsistent scores across sources.
- →Lean retrieved context (9.6k tokens) outperforms full-history baselines by 10.4 percentage points on LongMemEval_S benchmark
- →Bi-temporal knowledge graph with provenance tracking eliminates the need per-fact LLM calls while handling contradictions
- →Hybrid read path combining dense, lexical, graph, and temporal signals proves essential—facts alone lose recall
- →Open-source evaluation harness with reproducible commands addresses critical measurement integrity issues in memory benchmarks
- →8x reduction in token usage demonstrates practical deployment advantages for cost-sensitive production systems