AINeutralarXiv – CS AI · 7h ago6/10
🧠
LLM-WikiRace Benchmark: How Far Can LLMs Plan over Real-World Knowledge Graphs?
Researchers introduce LLM-WikiRace, a benchmark that tests large language models' planning and reasoning abilities by requiring them to navigate Wikipedia links from a source to target page. While frontier models like Gemini-3 achieve superhuman performance on easy tasks, success rates plummet to 23% on hard difficulty, revealing significant limitations in long-horizon planning and recovery from failures.
🧠 GPT-5🧠 Claude🧠 Opus