y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#scientific-reasoning News & Analysis

11 articles tagged with #scientific-reasoning. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

11 articles
AIBullisharXiv – CS AI · 5d ago7/10
🧠

Self-Trained Verification for Training- and Test-Time Self-Improvement

Researchers propose Self-Trained Verification (STV), a novel approach that improves AI reasoning models by training verifiers to catch self-generated errors using reference solutions as supervision. The method doubles accuracy on hard math problems and achieves 14x improvement on scientific reasoning tasks, while also enabling more effective self-training through verifier-in-the-loop training that further boosts performance by 33%.

AINeutralarXiv – CS AI · Apr 157/10
🧠

Evaluating Relational Reasoning in LLMs with REL

Researchers introduce REL, a benchmark framework that evaluates relational reasoning in large language models by measuring Relational Complexity (RC)—the number of entities that must be simultaneously bound to apply a relation. The study reveals that frontier LLMs consistently degrade in performance as RC increases, exposing a fundamental limitation in higher-arity reasoning that persists even with increased compute and in-context learning.

AIBullisharXiv – CS AI · Mar 46/102
🧠

OrchMAS: Orchestrated Reasoning with Multi Collaborative Heterogeneous Scientific Expert Structured Agents

Researchers have developed OrchMAS, a new multi-agent AI framework that uses specialized expert agents and dynamic orchestration to improve reasoning in scientific domains. The system addresses limitations of existing multi-agent frameworks by enabling flexible role allocation, prompt refinement, and heterogeneous model integration for complex scientific tasks.

AINeutralarXiv – CS AI · 2d ago6/10
🧠

FEM-Bench: A Structured Scientific Reasoning Benchmark for Evaluating Code-Generating LLMs

Researchers introduce FEM-Bench, a scientific reasoning benchmark designed to evaluate large language models' ability to generate correct finite element method (FEM) code for computational mechanics problems. Despite the simplicity of introductory-level tasks, current state-of-the-art LLMs show inconsistent performance, with Gemini 3 Pro completing 30/33 tasks at least once and GPT-5 achieving 73.8% success on unit test writing.

🧠 GPT-5🧠 Gemini
AINeutralarXiv – CS AI · 6d ago6/10
🧠

Verifiable Benchmarking of Long-Horizon Spatial Biology

Researchers introduced SpatialBench-Long, a comprehensive benchmark testing AI agents' ability to conduct end-to-end scientific reasoning on complex spatial biology data without prescribed methods. The benchmark spans 24 evaluations across multiple cancer and aging systems using diverse measurement technologies, with current leading models achieving only 11.1% success rate, revealing significant limitations in AI's capacity for autonomous biological discovery.

🏢 OpenAI🧠 GPT-5🧠 Gemini
AINeutralarXiv – CS AI · May 276/10
🧠

PolyFusionAgent: A Multimodal Foundation Model and Autonomous AI Assistant for Polymer Property Prediction and Inverse Design

Researchers introduce PolyFusionAgent, a multimodal AI framework combining a foundation model (PolyFusion) with an autonomous design agent (PolyAgent) for polymer discovery. The system integrates multiple polymer representations into a shared latent space to predict properties and generate novel structures, while grounding predictions in scientific literature for actionable design decisions.

AINeutralarXiv – CS AI · Apr 106/10
🧠

DISSECT: Diagnosing Where Vision Ends and Language Priors Begin in Scientific VLMs

Researchers introduce DISSECT, a 12,000-question diagnostic benchmark that reveals a critical "perception-integration gap" in Vision-Language Models—where VLMs successfully extract visual information but fail to reason about it during downstream tasks. Testing 18 VLMs across Chemistry and Biology shows open-source models systematically struggle with integrating visual input into reasoning, while closed-source models demonstrate superior integration capabilities.

AIBullisharXiv – CS AI · Mar 37/108
🧠

CHIMERA: Compact Synthetic Data for Generalizable LLM Reasoning

Researchers introduce CHIMERA, a compact 9K-sample synthetic dataset that enables smaller AI models to achieve reasoning performance comparable to much larger models. The dataset addresses key challenges in training reasoning-capable LLMs through automated generation and cross-validation across 8 scientific disciplines.

AINeutralarXiv – CS AI · Mar 54/10
🧠

BeamPERL: Parameter-Efficient RL with Verifiable Rewards Specializes Compact LLMs for Structured Beam Mechanics Reasoning

Researchers trained a compact 1.5B parameter language model to solve beam physics problems using reinforcement learning with verifiable rewards, achieving 66.7% improvement in accuracy. However, the model learned pattern-matching templates rather than true physics reasoning, failing to generalize to topological changes despite mastering the same underlying equations.