#reasoning-assessment News & Analysis

3 articles tagged with #reasoning-assessment. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

3 articles

AIBullisharXiv – CS AI · May 297/10

🧠

TRACE: Toulmin-based Reasoning Assessment through Constructive Elements for LLM CoT Evaluation

Researchers introduce TRACE, a novel metric for evaluating the reasoning quality of large language models' Chain-of-Thought outputs by analyzing argument structure rather than just final answers. The method combines Toulmin's argumentation theory with metacognitive frameworks and demonstrates strong correlation with benchmark accuracy while improving reinforcement learning performance.

AIBearisharXiv – CS AI · Jun 106/10

🧠

RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning

Researchers introduce RealMath-Eval, a benchmark revealing that state-of-the-art LLM judges fail to accurately evaluate authentic student mathematical reasoning, performing significantly worse on real exam responses (MSE ~2.96) than on synthetic LLM-generated solutions (MSE ~1.17). The study identifies an "Evaluation Gap" stemming from human errors occupying a more diverse semantic space than the predictable patterns found in synthetic errors.

AINeutralarXiv – CS AI · Apr 156/10

🧠

Beyond Output Correctness: Benchmarking and Evaluating Large Language Model Reasoning in Coding Tasks

Researchers introduce CodeRQ-Bench, the first benchmark for evaluating LLM reasoning quality across coding tasks including generation, summarization, and classification. They propose VERA, a two-stage evaluator combining evidence-grounded verification with ambiguity-aware score correction, achieving significant performance improvements over existing methods.