AINeutralarXiv – CS AI · 10h ago6/10
🧠
Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning
Researchers introduce a strategy-level evaluation framework for large language models on mathematical reasoning tasks, revealing a significant gap between high answer accuracy and actual reasoning flexibility. While frontier models achieve 95-100% accuracy on single-solution prompts, they recover substantially fewer problem-solving strategies than human references when asked to generate multiple approaches, with only 39-71% coverage depending on the model and iteration count.
🧠 Claude🧠 Gemini