AIBullisharXiv – CS AI · 9h ago7/10
🧠
Benchmarks in Leipzig
Researchers at the Max Planck Institute compiled 100 research-level mathematics questions to benchmark large language models' reasoning capabilities. Through three evaluation stages, only 2 questions remained unsolved by advanced LLMs, indicating significant progress in AI mathematical reasoning.