🧠 AI⚪ NeutralImportance 6/10

CrowdMath: A Dataset of Crowdsourced Mathematical Research Discussions

arXiv – CS AI|Sherin Muckatira, Jesse Geneson, Slava Gerovitch, Pavel Etingof, Mikhail Gronas, Anna Rumshisky|June 8, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce CrowdMath, a dataset of 164 expert-annotated collaborative mathematical problem-solving discussions from MIT PRIMES and Art of Problem Solving (2016-2025). While frontier AI models achieve 83-88% accuracy in predicting next posts, they struggle significantly with understanding the functional roles of contributions in mathematical reasoning, revealing a gap between solving isolated problems and comprehending collaborative research progress.

Analysis

CrowdMath addresses a fundamental limitation in how AI systems are evaluated on mathematical reasoning. Traditional benchmarks focus on well-defined problems with verifiable solutions, but real mathematical research involves iterative collaboration where participants propose partial ideas, critique flawed reasoning, and incrementally build toward proofs. This dataset captures that authentic process by annotating 164 discussion chains from a program that has produced peer-reviewed publications, providing a realistic testbed for evaluating how AI models understand the dynamic flow of mathematical discourse.

The benchmark results reveal a critical distinction in AI capabilities. Models perform well on local prediction tasks—understanding what logically follows next in a discussion—achieving 83-88% accuracy. However, their ability to classify the functional role of individual posts drops dramatically to 0.42 macro-F1, indicating they struggle to understand whether a contribution represents genuine progress, error correction, or a dead-end exploration. This gap suggests that while language models can follow immediate logical sequences, they lack deeper comprehension of how ideas integrate into broader problem-solving narratives.

For the AI research community, CrowdMath exposes the limitations of current evaluation paradigms and provides infrastructure for developing models that understand collaborative reasoning. This has implications for research automation and AI-assisted problem-solving tools, which must do more than generate plausible next steps—they need to recognize when ideas represent meaningful progress. The dataset's publication creates opportunities to benchmark future models on a task that more closely mirrors human mathematical research, potentially driving development toward AI systems better suited for collaborative scientific work.

Key Takeaways

→CrowdMath dataset contains 164 annotated collaborative math problem-solving discussions from a real research program that produced peer-reviewed publications
→Frontier AI models achieve 83-88% accuracy on next-post prediction but only 0.42 macro-F1 on classifying the functional role of contributions
→Current LLMs can follow local logical flow in mathematical discussions but fail to understand the broader significance of individual contributions to problem-solving
→The benchmark exposes a gap between solving well-specified math problems and comprehending collaborative mathematical reasoning as it develops
→CrowdMath provides new evaluation infrastructure for training AI systems that understand research progress rather than isolated problem completion