AIBullisharXiv – CS AI · 5d ago7/10
🧠Researchers propose Self-Trained Verification (STV), a novel approach that improves AI reasoning models by training verifiers to catch self-generated errors using reference solutions as supervision. The method doubles accuracy on hard math problems and achieves 14x improvement on scientific reasoning tasks, while also enabling more effective self-training through verifier-in-the-loop training that further boosts performance by 33%.
AINeutralarXiv – CS AI · Apr 157/10
🧠Researchers introduce REL, a benchmark framework that evaluates relational reasoning in large language models by measuring Relational Complexity (RC)—the number of entities that must be simultaneously bound to apply a relation. The study reveals that frontier LLMs consistently degrade in performance as RC increases, exposing a fundamental limitation in higher-arity reasoning that persists even with increased compute and in-context learning.
AIBullisharXiv – CS AI · Mar 57/10
🧠Researchers developed a new AI training method using knowledge graphs as reward models to improve compositional reasoning in specialized domains. The approach enables smaller 14B parameter models to outperform much larger frontier systems like GPT-5.2 and Gemini 3 Pro on complex multi-hop reasoning tasks in medicine.
🧠 Gemini
AIBullisharXiv – CS AI · Mar 46/102
🧠Researchers have developed OrchMAS, a new multi-agent AI framework that uses specialized expert agents and dynamic orchestration to improve reasoning in scientific domains. The system addresses limitations of existing multi-agent frameworks by enabling flexible role allocation, prompt refinement, and heterogeneous model integration for complex scientific tasks.
AINeutralarXiv – CS AI · 2d ago6/10
🧠Researchers introduce FEM-Bench, a scientific reasoning benchmark designed to evaluate large language models' ability to generate correct finite element method (FEM) code for computational mechanics problems. Despite the simplicity of introductory-level tasks, current state-of-the-art LLMs show inconsistent performance, with Gemini 3 Pro completing 30/33 tasks at least once and GPT-5 achieving 73.8% success on unit test writing.
🧠 GPT-5🧠 Gemini
AINeutralarXiv – CS AI · 6d ago6/10
🧠Researchers introduced SpatialBench-Long, a comprehensive benchmark testing AI agents' ability to conduct end-to-end scientific reasoning on complex spatial biology data without prescribed methods. The benchmark spans 24 evaluations across multiple cancer and aging systems using diverse measurement technologies, with current leading models achieving only 11.1% success rate, revealing significant limitations in AI's capacity for autonomous biological discovery.
🏢 OpenAI🧠 GPT-5🧠 Gemini
AINeutralarXiv – CS AI · May 276/10
🧠Researchers introduce PolyFusionAgent, a multimodal AI framework combining a foundation model (PolyFusion) with an autonomous design agent (PolyAgent) for polymer discovery. The system integrates multiple polymer representations into a shared latent space to predict properties and generate novel structures, while grounding predictions in scientific literature for actionable design decisions.
AINeutralarXiv – CS AI · Apr 106/10
🧠Researchers introduce DISSECT, a 12,000-question diagnostic benchmark that reveals a critical "perception-integration gap" in Vision-Language Models—where VLMs successfully extract visual information but fail to reason about it during downstream tasks. Testing 18 VLMs across Chemistry and Biology shows open-source models systematically struggle with integrating visual input into reasoning, while closed-source models demonstrate superior integration capabilities.
AIBearisharXiv – CS AI · Mar 276/10
🧠Researchers introduce MolQuest, a new benchmark for evaluating AI models' ability to perform complex chemical structure elucidation through multi-step reasoning. Even state-of-the-art AI models achieve only 50% accuracy on this real-world scientific task, revealing significant limitations in current AI systems' strategic reasoning capabilities.
AIBullisharXiv – CS AI · Mar 37/108
🧠Researchers introduce CHIMERA, a compact 9K-sample synthetic dataset that enables smaller AI models to achieve reasoning performance comparable to much larger models. The dataset addresses key challenges in training reasoning-capable LLMs through automated generation and cross-validation across 8 scientific disciplines.
AINeutralarXiv – CS AI · Mar 54/10
🧠Researchers trained a compact 1.5B parameter language model to solve beam physics problems using reinforcement learning with verifiable rewards, achieving 66.7% improvement in accuracy. However, the model learned pattern-matching templates rather than true physics reasoning, failing to generalize to topological changes despite mastering the same underlying equations.