Reasoning Graphs: Self-Improving, Deterministic RAG through Evidence-Centric Feedback
Researchers introduce reasoning graphs, a persistent knowledge structure that improves language model reasoning accuracy by storing and reusing chains of thought tied to evidence items. The system achieves 47% error reduction on multi-hop questions and maintains deterministic outputs without model retraining, using only context engineering.
Reasoning graphs represent a meaningful advancement in retrieval-augmented generation (RAG) systems by addressing a fundamental inefficiency: language models discard their reasoning process after each query, forcing the system to reason from scratch repeatedly. This new architecture persists structured chains of thought as graph edges connected to specific evidence items, creating an institutional memory that improves systematically as the system encounters repeated query patterns.
The research builds on growing recognition that RAG systems suffer from inconsistency and poor generalization. While previous memory mechanisms retrieve distilled strategies by query similarity, reasoning graphs enable evidence-centric feedback—when new evidence arrives, the system traverses all prior evaluations of that specific item across historical runs. This shift from query-based to evidence-based retrieval fundamentally changes how systems learn from experience. Coupled with retrieval graphs that optimize the candidate pipeline, the system creates a self-improving loop that requires no model retraining, operating entirely through intelligent context engineering.
The empirical results demonstrate substantial practical value. At 50%+ evidence coverage, the system reduces errors by 47% on the same questions compared to vanilla RAG (p < 0.0001), with particularly strong gains on complex multi-hop reasoning (+11 percentage points). In high-reuse deployment scenarios, the system achieves Pareto dominance: simultaneously achieving highest accuracy, 47% lower computational cost, and 46% lower latency. Verdict consistency improves by 7-8 percentage points, with all 11 hard probes reaching perfect consistency.
This work signals growing maturity in LLM engineering. Rather than chasing architectural breakthroughs, researchers increasingly focus on intelligent context management and graph-based memory systems that compound improvements over time. For production systems handling repeated query patterns, this architecture offers immediate efficiency gains alongside accuracy improvements.
- →Reasoning graphs persist chains of thought as evidence-linked structures, enabling systematic accuracy improvements without model retraining.
- →Evidence-centric feedback reduces errors by 47% on multi-hop questions and improves accuracy by 11 percentage points on 4-hop reasoning tasks.
- →High-reuse deployment scenarios achieve 47% cost reduction and 46% latency improvement alongside highest accuracy, demonstrating practical production value.
- →The system eliminates verdict-level variance through deterministic outputs while maintaining temperature flexibility, addressing consistency problems in current RAG systems.
- →All gains come from context engineering through graph traversal, avoiding computational overhead of model retraining or fine-tuning approaches.