y0news
AnalyticsDigestsRSSAICrypto
#academic-testing1 article
1 articles
AINeutralarXiv โ€“ CS AI ยท 5h ago1
๐Ÿง 

Classroom Final Exam: An Instructor-Tested Reasoning Benchmark

Researchers introduce CFE-Bench, a new multimodal benchmark for evaluating AI reasoning across 20+ STEM domains using authentic university exam problems. The best performing model, Gemini-3.1-pro-preview, achieved only 59.69% accuracy, highlighting significant gaps in AI reasoning capabilities, particularly in maintaining correct intermediate states through multi-step solutions.