ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics
Researchers introduce ComBench, a new benchmark containing 100 Olympiad-level combinatorics problems designed to evaluate large language models' mathematical reasoning capabilities. The benchmark reveals that even frontier models struggle with combinatorial problems, with the best performance reaching only 65.4%, and identifies that rigorous proof reasoning and constructive problem-solving are distinct capabilities that models handle unevenly.
ComBench addresses a critical gap in AI evaluation by focusing on combinatorial reasoning, a domain that requires both creative insight and rigorous mathematical argumentation. The benchmark's dual structure—separating analysis-centric problems (requiring formal proofs) from construction-centric problems (requiring explicit constructions)—provides nuanced diagnostic capabilities that traditional benchmarks lack. This distinction matters because it reveals that models don't uniformly struggle across mathematical domains; different architectures excel at different types of reasoning.
The research reflects broader trends in AI evaluation where general benchmarks increasingly fail to capture domain-specific capabilities. As large language models advance on standardized tests, researchers must develop increasingly specialized assessments to identify remaining weaknesses. Olympiad-level problems represent a genuine upper bound of mathematical reasoning because they require not just computational ability but creative problem-solving and structural insight.
The findings have significant implications for AI development priorities. The divergence between models—with Kimi-K2.6 outperforming GPT-4.5 on construction problems despite trailing on proof grading—suggests that different training approaches yield different cognitive strengths. This informs where researchers should focus improvement efforts and how organizations should evaluate model capabilities for specific applications requiring mathematical reasoning.
The saturation analysis indicates substantial room for improvement, making ComBench valuable for tracking progress. The consistent difficulty of existence and construction problems across models points toward architectural or training limitations that future research should address. Organizations deploying LLMs for technical domains should monitor whether models show balanced mathematical capabilities or whether they remain specialized toward particular reasoning types.
- →ComBench reveals frontier models achieve only 65.4% maximum performance on Olympiad-level combinatorics, indicating substantial gaps in creative mathematical reasoning.
- →Rigorous proof reasoning and constructive problem-solving emerge as distinct capabilities that models handle unevenly, with some models excelling at proofs but struggling with constructions.
- →The benchmark combines rubric-guided proof grading with deterministic verification, exposing cases where proof quality diverges from construction validity.
- →Construction and existence problems remain consistently harder across all tested models, suggesting they represent fundamental reasoning challenges.
- →The divergence between models like Kimi-K2.6 and GPT-4.5 indicates different training approaches produce different mathematical reasoning strengths.