AINeutralarXiv โ CS AI ยท 14h ago7/10
๐ง
The Amazing Agent Race: Strong Tool Users, Weak Navigators
Researchers introduce The Amazing Agent Race (AAR), a new benchmark revealing that LLM agents excel at tool-use but struggle with navigation tasks. Testing three agent frameworks on 1,400 complex, graph-structured puzzles shows the best achieve only 37.2% accuracy, with navigation errors (27-52% of failures) far outweighing tool-use failures (below 17%), exposing a critical blind spot in existing linear benchmarks.
๐ง Claude