A Matter of Interest: Understanding Interestingness of Math Problems in Humans and Language Models
Researchers compared how large language models rate the interestingness of math problems against human judgments from college students and International Math Olympiad competitors. While LLMs show broad agreement with humans, they fail to match the distribution of human preferences and poorly explain why problems are interesting, though they can generate novel engaging problems after validity filtering.
This research addresses a critical gap as AI systems become embedded in mathematical research and education. The study reveals that despite LLMs' general ability to identify interesting problems, their judgments diverge significantly from human mathematical intuition in ways that matter for deploying these tools responsibly. The misalignment appears twofold: LLMs don't replicate the specific distribution of what humans find interesting, and they struggle to articulate the reasoning behind interestingness judgments—a gap that could mislead students or researchers relying on AI guidance.
The research builds on growing recognition that LLMs, while powerful language processors, don't necessarily internalize human values or preferences even when performing well on aggregate metrics. This aligns with broader AI alignment challenges where systems trained on vast text corpora may mimic surface-level patterns without capturing deeper contextual understanding. The finding that LLMs can generate valid novel problems suggests their limitations are in judgment calibration rather than creative capacity.
For the mathematics education and research communities, this means LLMs cannot yet serve as standalone advisors for problem selection or curriculum design. Instead, the authors advocate for collaborative human-AI systems where multiple models and human perspectives are integrated. This has implications for educational technology developers and researchers building AI-assisted mathematics platforms. The work underscores that deploying LLMs in high-stakes intellectual domains requires careful validation against domain-expert preferences, not just technical accuracy metrics.
- →LLMs broadly identify interesting math problems but fail to match the distribution of human preferences across different expertise levels
- →LLMs poorly explain their interestingness judgments, showing weak correlation to human-selected rationales for why problems matter
- →LLMs can generate novel valid math problems, indicating creative capacity exists but judgment calibration needs improvement
- →Multi-LLM collaborative human-AI systems are necessary before deploying language models as trustworthy partners in mathematical reasoning
- →This research highlights the gap between aggregate LLM performance and fine-grained alignment with human domain expertise