AINeutralarXiv – CS AI · 10h ago7/10
🧠
MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs
Researchers introduced MathConstraint, an adaptive benchmark for testing large language models' combinatorial reasoning abilities using constraint satisfaction problems with automated verification. The benchmark reveals significant performance gaps between frontier models, with accuracy dropping from 72-87% on easier instances to 18-66% on harder ones, while tool access via Python solvers roughly doubles performance.
🧠 GPT-5