TraceGraph: Shared Decision Landscapes for Diagnosing and Improving Agent Trajectories
TraceGraph is a new graph-based framework that analyzes multi-model agent trajectories to create shared decision landscapes, revealing how different AI models navigate tasks differently. The tool identifies failure regions and trap states, enabling targeted improvements that increased resolved rates on SWE-bench by 3-4.8%, demonstrating that aggregate benchmark scores mask critical performance divergences.
TraceGraph addresses a fundamental limitation in AI agent evaluation: the reduction of complex interaction trajectories to single metrics like pass rates or reward scores. This framework reconstructs the decision-making landscape across multiple model rollouts, creating a unified map where different agents' paths can be compared. By identifying 'trap regions'—states where models consistently fail—and 'productive cores'—successful pathways—the research reveals that benchmark splits incentivize different failure-recovery patterns. Some environments reward trap avoidance while others reward recovery capability, a distinction invisible in aggregate metrics.
This work builds on growing recognition within AI research that understanding agent behavior requires process-level analysis rather than outcome-level summary. As language models and specialized agents increasingly tackle complex tasks like software engineering, the ability to diagnose where and why models fail becomes critical for improvement. TraceGraph's application to SWE-bench demonstrates practical value: a trap-aware recovery pipeline leveraging identified failure regions improved resolution rates from 40.4% to 43.5% on certain subsets, with provider-specific optimizations showing the landscape approach captures meaningful differences in model architecture and training.
For the AI development community, TraceGraph provides both diagnostic and predictive capabilities. Rather than treating agent failure as monolithic, the framework enables targeted interventions at failure points. The 3-4.8 percentage point improvement on SWE-bench, while modest, shows meaningful gains from understanding failure morphology. As AI agents expand into high-stakes domains, moving beyond aggregate scores to process-level understanding of failure and recovery patterns becomes essential for safe deployment.
- →TraceGraph maps shared decision landscapes from multi-model trajectories, revealing failure patterns invisible in aggregate benchmark scores.
- →The framework identifies trap regions and productive cores, showing that benchmark splits incentivize different failure-avoidance versus recovery strategies.
- →A trap-aware recovery pipeline based on TraceGraph analysis improved SWE-bench resolution rates by 3.1-3.8 percentage points with provider-specific optimizations.
- →Process-level trajectory analysis enables targeted agent improvements beyond what outcome-level metrics can diagnose.
- →TraceGraph establishes a vocabulary for understanding where models diverge on shared tasks and how failure regions guide downstream development.