y0news
← Feed
←Back to feed
🧠 AIβšͺ Neutral

Classroom Final Exam: An Instructor-Tested Reasoning Benchmark

arXiv – CS AI|Chongyang Gao, Diji Yang, Shuyan Zhou, Xichen Yan, Luchuan Song, Shuo Li, Kezhen Chen||1 views
πŸ€–AI Summary

Researchers introduce CFE-Bench, a new multimodal benchmark for evaluating AI reasoning across 20+ STEM domains using authentic university exam problems. The best performing model, Gemini-3.1-pro-preview, achieved only 59.69% accuracy, highlighting significant gaps in AI reasoning capabilities, particularly in maintaining correct intermediate states through multi-step solutions.

Key Takeaways
  • β†’CFE-Bench uses real university homework and exam problems with instructor solutions to test AI reasoning across STEM fields.
  • β†’Top AI model Gemini-3.1-pro-preview scored only 59.69% accuracy, showing substantial room for improvement.
  • β†’AI models struggle to maintain correct intermediate states throughout complex multi-step problem solving.
  • β†’Current models use more reasoning steps than human instructors, indicating lower efficiency and higher error risk.
  • β†’The benchmark reveals frontier AI models can answer sub-questions correctly but fail at comprehensive reasoning flows.
Read Original β†’via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains β€” you keep full control of your keys.
Connect Wallet to AI β†’How it works
Related Articles