AINeutralarXiv โ CS AI ยท 4h ago6/10
๐ง
Beyond Output Correctness: Benchmarking and Evaluating Large Language Model Reasoning in Coding Tasks
Researchers introduce CodeRQ-Bench, the first benchmark for evaluating LLM reasoning quality across coding tasks including generation, summarization, and classification. They propose VERA, a two-stage evaluator combining evidence-grounded verification with ambiguity-aware score correction, achieving significant performance improvements over existing methods.