🧠 AI🟢 BullishImportance 6/10

TRAJEVAL: Decomposing Code Agent Trajectories for Fine-Grained Diagnosis

arXiv – CS AI|Myeongsoo Kim, Dingmin Wang, Siwei Cui, Farima Farmahinifarahani, Shweta Garg, Baishakhi Ray, Terry Yue Zhuo, Rajdeep Mukherjee, Varun Kumar|March 27, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce TRAJEVAL, a diagnostic framework that breaks down AI code agent performance into three stages (search, read, edit) to identify specific failure points rather than just binary pass/fail outcomes. The framework analyzed 16,758 trajectories and found that real-time feedback based on trajectory signals improved state-of-the-art models by 2.2-4.6 percentage points while reducing costs by 20-31%.

Key Takeaways

→TRAJEVAL decomposes AI agent trajectories into search, read, and edit stages for fine-grained performance diagnosis.
→All analyzed agents examine approximately 22x more functions than necessary, indicating universal inefficiencies.
→Different AI models show distinct failure patterns: GPT-5 locates code well but targets edits poorly, while Qwen-32B fails at file discovery.
→Real-time trajectory feedback improved model performance by 2.2-4.6 percentage points while cutting costs by 20-31%.
→The framework enables predictive analysis, achieving model-level Pass@1 prediction within 0.87-2.1% mean absolute error.

Mentioned in AI

Models

GPT-5OpenAI