AINeutralarXiv โ CS AI ยท Feb 276/106
๐ง
Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs
Researchers introduced ReasoningMath-Plus, a new benchmark with 150 problems designed to evaluate structural mathematical reasoning in large language models. The study reveals that while leading LLMs achieve relatively high final-answer accuracy, they perform significantly worse on process-level evaluation metrics, indicating that answer-only assessments may overestimate actual reasoning capabilities.
$NEAR