AIBearisharXiv โ CS AI ยท 10h ago7/10
๐ง
Robust Reasoning Benchmark
Researchers have developed a 14-technique perturbation pipeline to test the robustness of large language models' reasoning capabilities on mathematical problems. Testing reveals that while frontier models maintain resilience, open-weight models experience catastrophic accuracy collapses up to 55%, and all tested models degrade when solving sequential problems in a single context window, suggesting fundamental architectural limitations in current reasoning systems.
๐ง Claude๐ง Opus