AINeutralarXiv – CS AI · 14h ago6/10
🧠
BenchTrace: A Benchmark for Testing Reflection Ability and Controlled Evolution in LLM Agents
Researchers introduce BenchTrace, a benchmark framework for evaluating how well large language model agents learn from failures through reflection and self-evolution. Testing on Qwen3-32B and GPT-4.1 reveals significant limitations: both models achieve below 30% accuracy on reflection tasks, struggle with diagnosis, and experience performance degradation as noise accumulates in their learning processes.
🧠 GPT-4