AINeutralarXiv – CS AI · Mar 46/103
🧠Researchers introduce CFE-Bench, a new multimodal benchmark for evaluating AI reasoning across 20+ STEM domains using authentic university exam problems. The best performing model, Gemini-3.1-pro-preview, achieved only 59.69% accuracy, highlighting significant gaps in AI reasoning capabilities, particularly in maintaining correct intermediate states through multi-step solutions.
GeneralNeutralFortune Crypto · 3d ago6/10
📰Girls Who Code's CEO reports that 70% of teen girls express interest in cybersecurity careers, yet the industry is failing to retain them, contributing to a 4.7-million-person workforce gap. The article highlights a critical untapped talent pipeline in an industry facing severe labor shortages, suggesting that targeted recruitment and retention of interested female youth could address a major structural skills deficit.
AIBullisharXiv – CS AI · 3d ago6/10
🧠Aryabhata 2 is a specialized language model designed for competitive STEM examinations that uses reinforcement learning to improve reasoning capabilities while reducing computational output by up to 64%. Trained on PhysicsWallah's question banks, it outperforms its base model on JEE and NEET exams, addressing the practical challenge of deploying AI at scale for educational applications.
AINeutralarXiv – CS AI · 4d ago6/10
🧠A longitudinal study analyzing PISA data from 2018-2022 reveals that students globally show increasing ICT career aspirations despite pandemic-related learning disruptions, with digital skills emerging as the strongest predictor of career readiness for the AI era. The research indicates that educational systems are unevenly preparing students for AI-driven labor markets, suggesting structural gaps in how different countries develop foundational competencies.
AIBullisharXiv – CS AI · May 96/10
🧠Researchers at Oregon State University developed LaTA, an open-source autograder that runs locally on institutional hardware to grade STEM assignments while maintaining FERPA compliance and eliminating data exposure risks. Deployed in a mechanical engineering course serving ~200 students, LaTA achieved a 0.02-0.04% error rate and correlated with 8-11% higher exam performance compared to traditionally-graded cohorts.
AINeutralarXiv – CS AI · May 76/10
🧠Researchers evaluated three major LLMs (Claude, Gemini, ChatGPT) on multimodal physics problems and found a significant performance drop compared to text-only tasks, identifying visual processing as the primary failure mode. A structured dialogue intervention corrected 82% of errors overall and achieved 100% correction on visual processing errors, offering immediate solutions for educators without requiring model retraining.
🧠 ChatGPT🧠 Claude🧠 Gemini
AINeutralarXiv – CS AI · Apr 156/10
🧠Researchers evaluated GPT-4o's ability to score physics exam responses using rubric-assisted scoring, finding that AI reliability matches human inter-rater consistency when rubrics are well-structured and granular. The study reveals that clear rubric design matters far more than LLM configuration choices, with performance declining on ambiguous mid-range responses.
🧠 GPT-4
AIBullishOpenAI News · Mar 105/10
🧠ChatGPT has launched new interactive visual explanations for math and science subjects, allowing students to explore formulas, variables, and concepts through real-time visual interactions. This educational enhancement represents OpenAI's continued expansion of ChatGPT's capabilities beyond text-based responses.
🧠 ChatGPT
AINeutralarXiv – CS AI · Mar 34/105
🧠Researchers developed MMGrader, an AI system to assess student mental models from multimodal responses using concept graphs. Testing 9 open AI models showed they achieved only 40% accuracy compared to human evaluators, indicating current limitations in educational AI assessment tools.