AINeutralarXiv – CS AI · 10h ago6/10
🧠
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces
Researchers introduce OPT-BENCH, a benchmark evaluating whether large language models can self-improve through iterative feedback in complex problem spaces. Testing 19 LLMs across machine learning and NP-hard problems reveals that while stronger models adapt better, even the most advanced systems remain constrained by their base capabilities and fall short of human expert performance.