AINeutralarXiv – CS AI · 7h ago6/10
🧠
Constructing Evaluation Datasets for Procedural Reasoning: Balancing Naturalness, Grounding, and Multi-Hop Coverage
Researchers present a framework for evaluating procedural reasoning datasets in AI-supported learning systems by comparing three question-generation strategies based on Task-Method-Knowledge (TMK) models. The study demonstrates that strict TMK generation produces the most grounded and usable datasets (96.5% grounded), while transcript-based approaches sacrifice representational alignment for naturalness, highlighting the trade-off between learner-like phrasing and formal grounding in evaluation dataset construction.