Human vs Machine Mathematical Difficulty on Project Euler: An Experimental Analysis
A new study analyzing 3,840 AI attempts across 50 mathematical problems from Project Euler finds that frontier AI systems scale more efficiently with problem difficulty than previously predicted, with machine effort following a power-law relationship where the exponent is less than 1 for most models tested. This suggests AI systems may actually improve relative to humans as problems become harder, contrary to earlier theoretical predictions.
This research challenges a long-standing assumption in AI capability assessment: that machines would face degrading returns as problem difficulty increases. By analyzing data from MathArena's Project Euler benchmark, researchers discovered that the scaling exponent b is less than 1 for 20 of 25 models, meaning token cost grows sublinearly with human solve times. This inversion of expected difficulty scaling has significant implications for understanding AI trajectory and capability gains.
The study builds on Timothy Gowers' theoretical framework proposing a power-law relationship between machine effort and human difficulty. Rather than confirming that machines degrade worse than humans on harder problems, the empirical evidence suggests frontier models maintain surprisingly efficient scaling. The research also validates an exponential decay model for success probability, with median RΒ² of 0.92 across top configurations, providing predictive power for estimating when AI systems will solve increasingly difficult problem classes.
The practical implications are substantial for AI development roadmaps and capability forecasting. If current scaling trends persist, the state-of-the-art's 50% task-length horizon is doubling roughly every 75 days, representing rapid progress on mathematical reasoning. This metric suggests AI systems are closing gaps faster on complex problems than on simple ones, inverting typical human learning patterns. For researchers and capability analysts, these findings provide empirical grounding for predicting when frontier models will achieve specific mathematical competency levels, though the study's focus on computational mathematics may not generalize fully to other domains.
- βAI systems demonstrate sublinear scaling with problem difficulty (exponent b < 1), meaning they improve relative to humans on harder mathematical problems.
- βSuccess probability follows predictable exponential decay patterns across problem difficulty levels, enabling better capability forecasting.
- βState-of-the-art AI is doubling its mathematical task-length horizon approximately every 75 days based on current trajectory.
- βFrontier models now solve Project Euler problems that would take humans 2.5-4.3 hours, indicating substantial progress in mathematical reasoning.
- βThe study contradicts earlier predictions that machines would scale worse than humans with increasing problem difficulty.