βBack to feed
π§ AIβͺ Neutral
Classroom Final Exam: An Instructor-Tested Reasoning Benchmark
arXiv β CS AI|Chongyang Gao, Diji Yang, Shuyan Zhou, Xichen Yan, Luchuan Song, Shuo Li, Kezhen Chen||1 views
π€AI Summary
Researchers introduce CFE-Bench, a new multimodal benchmark for evaluating AI reasoning across 20+ STEM domains using authentic university exam problems. The best performing model, Gemini-3.1-pro-preview, achieved only 59.69% accuracy, highlighting significant gaps in AI reasoning capabilities, particularly in maintaining correct intermediate states through multi-step solutions.
Key Takeaways
- βCFE-Bench uses real university homework and exam problems with instructor solutions to test AI reasoning across STEM fields.
- βTop AI model Gemini-3.1-pro-preview scored only 59.69% accuracy, showing substantial room for improvement.
- βAI models struggle to maintain correct intermediate states throughout complex multi-step problem solving.
- βCurrent models use more reasoning steps than human instructors, indicating lower efficiency and higher error risk.
- βThe benchmark reveals frontier AI models can answer sub-questions correctly but fail at comprehensive reasoning flows.
#ai-benchmarking#llm-evaluation#reasoning-capabilities#stem-education#gemini#multimodal-ai#academic-testing#ai-limitations
Read Original βvia arXiv β CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains β you keep full control of your keys.
Related Articles