🧠 AI⚪ NeutralImportance 6/10

Classroom Final Exam: An Instructor-Tested Reasoning Benchmark

arXiv – CS AI|Chongyang Gao, Diji Yang, Shuyan Zhou, Xichen Yan, Luchuan Song, Shuo Li, Kezhen Chen|March 4, 2026 at 05:00 AM|3 views

🤖AI Summary

Researchers introduce CFE-Bench, a new multimodal benchmark for evaluating AI reasoning across 20+ STEM domains using authentic university exam problems. The best performing model, Gemini-3.1-pro-preview, achieved only 59.69% accuracy, highlighting significant gaps in AI reasoning capabilities, particularly in maintaining correct intermediate states through multi-step solutions.

Key Takeaways

→CFE-Bench uses real university homework and exam problems with instructor solutions to test AI reasoning across STEM fields.
→Top AI model Gemini-3.1-pro-preview scored only 59.69% accuracy, showing substantial room for improvement.
→AI models struggle to maintain correct intermediate states throughout complex multi-step problem solving.
→Current models use more reasoning steps than human instructors, indicating lower efficiency and higher error risk.
→The benchmark reveals frontier AI models can answer sub-questions correctly but fail at comprehensive reasoning flows.