GTBench: A Curriculum-Grounded Benchmark for Evaluating LLMs as Mathematical Research Assistants in Graph Theory
Researchers introduced GTBench, a curriculum-based benchmark with 63 graph theory problems designed to evaluate LLMs as mathematical research assistants. Testing five frontier models revealed significant performance gaps, with GPT-5 substantially outperforming competitors on advanced proofs while all models struggled with graduate-level reasoning, raising concerns about AI reliability in technical education and research.
GTBench addresses a critical gap in AI evaluation methodology by introducing a structured, difficulty-tiered framework specifically designed to measure LLM performance on mathematical reasoning. Rather than generic benchmarks, this approach mirrors how humans progress through mathematical disciplines, enabling more precise assessment of where current models fail. The curriculum structure—from basic definitions through graduate proofs—mirrors authentic learning pathways, making results immediately relevant to educational deployment decisions.
The performance hierarchy documented in this research has significant implications for AI-assisted learning environments. GPT-5's near-ceiling performance on foundational concepts but substantial degradation on proof construction suggests current models excel at pattern recognition and routine problem-solving but struggle with original mathematical reasoning. The 0% performance by Llama 3.3 70B on graduate proofs under human evaluation, contrasted with GPT-5's 82%, indicates model scale and architecture substantially influence mathematical reasoning capacity.
For educational institutions and research organizations, these findings suggest careful governance frameworks are necessary before deploying LLMs as primary research assistants. The hybrid evaluation protocol—combining human expert assessment with LLM-as-judge scoring—revealed systematic disagreement (kappa 0.48-0.83), particularly on verbose or near-complete proofs. This discrepancy highlights risks of automated assessment systems missing nuanced mathematical reasoning or accepting incorrect logic presented persuasively.
The dominance of "correct algorithm, wrong execution" errors across groups 1-2 indicates systematic weaknesses in translating high-level mathematical understanding into accurate implementation. Looking forward, researchers should track whether emerging models improve execution fidelity and probe whether specialized fine-tuning on mathematical domains reduces these gaps or merely masks underlying reasoning limitations.
- →GPT-5 demonstrates substantially higher mathematical reasoning than competitors, achieving 95.8% on basic concepts but 82% on graduate-level proofs.
- →All evaluated models show pronounced performance degradation with problem difficulty, suggesting fundamental limitations in advanced mathematical reasoning.
- →Systematic disagreement between human and automated evaluators (kappa 0.48-0.83) raises concerns about deploying LLMs as solo assessment tools in mathematics.
- →"Correct algorithm, wrong execution" errors represent the dominant failure mode, indicating models struggle translating high-level understanding into accurate implementation.
- →GTBench provides the first curriculum-grounded evaluation framework for assessing LLM reliability in mathematical education and research contexts.