AINeutralarXiv โ CS AI ยท 5h ago1
๐ง
Classroom Final Exam: An Instructor-Tested Reasoning Benchmark
Researchers introduce CFE-Bench, a new multimodal benchmark for evaluating AI reasoning across 20+ STEM domains using authentic university exam problems. The best performing model, Gemini-3.1-pro-preview, achieved only 59.69% accuracy, highlighting significant gaps in AI reasoning capabilities, particularly in maintaining correct intermediate states through multi-step solutions.