AINeutralarXiv โ CS AI ยท 7h ago6/10
๐ง
From Test-taking to Cognitive Scaffolding: A Pedagogical Diagnostic Benchmark for LLMs on English Standardized Tests
Researchers introduce ESTBook, a pedagogical diagnostic benchmark containing 10,576 multimodal questions across five major English standardized tests, designed to evaluate whether large language models can exhibit faithful reasoning and identify student misconceptions rather than just achieving binary accuracy scores. The framework moves beyond traditional test-taking benchmarks by enriching questions with cognitive reasoning trajectories and distractor rationales, enabling better assessment of LLM capabilities as educational tutoring tools.