y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#scientific-ai News & Analysis

35 articles tagged with #scientific-ai. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

35 articles
AIBullisharXiv – CS AI · 3d ago7/10
🧠

MolLingo: Molecule-Native Representations for LLM-Powered Scientific Agents

Researchers introduce MolLingo, a multi-agent AI system that automates molecular design by coordinating specialized agents through shared memory and domain-specific tools. The system uses BRICS-based Fragment Enumeration to represent molecules in chemically meaningful ways that LLMs can reason about effectively, achieving superior performance on drug design benchmarks compared to frontier models like GPT-5.

🧠 GPT-5
AIBullisharXiv – CS AI · 3d ago7/10
🧠

AIBuildAI-2: A Knowledge-Enhanced Agent for Automatically Building AI Models

AIBuildAI-2 introduces a knowledge-enhanced AI agent that automatically builds machine learning models by combining large language models with an external, evolving knowledge system. The system achieves state-of-the-art performance, ranking first on MLE-Bench and placing in the top 6.6% of human teams in a predictive competition, democratizing AI model development for non-specialists.

AIBullisharXiv – CS AI · 4d ago7/10
🧠

ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence

ScientistOne introduces Chain-of-Evidence, a verifiability framework addressing critical failures in autonomous research systems where AI agents produce plausible-looking but unreliable outputs including fabricated citations, unverified scores, and misaligned methods. The system achieves zero hallucinated references and perfect score verification across five research tasks, significantly outperforming existing baseline systems that exhibit systematic failure rates up to 80%.

AIBullisharXiv – CS AI · 4d ago7/10
🧠

AutoDFT: A Closed-Loop Multi-Agent Framework for Autonomous DFT Calculations

AutoDFT is a closed-loop multi-agent framework that automates density functional theory (DFT) calculations by embedding LLM reasoning throughout the entire computational lifecycle, rather than just the planning phase. The system achieves 94.1% success on a 34-task benchmark and enables non-experts to obtain reliable computational chemistry results by dynamically adapting to failures and unexpected outcomes.

🧠 GPT-5
AIBullishMIT Technology Review · May 227/10
🧠

Google I/O showed how the path for AI-driven science is shifting

During Google I/O, DeepMind CEO Demis Hassabis stated we are approaching the "singularity," signaling that AI-driven scientific advancement is accelerating rapidly. The keynote highlighted Google's positioning of AI as a transformative force for research and development across industries.

🏢 Google
AIBearisharXiv – CS AI · May 127/10
🧠

MDGYM: Benchmarking AI Agents on Molecular Simulations

Researchers introduced MDGYM, a benchmark testing AI agents' ability to autonomously execute molecular dynamics simulations, finding that even the strongest systems solve only 21% of easy tasks. The poor performance reveals that advanced code generation does not translate to physical reasoning, exposing a critical gap between general software engineering competence and domain-specific scientific workflows.

🧠 Claude
AIBullisharXiv – CS AI · Apr 107/10
🧠

Scientific Knowledge-driven Decoding Constraints Improving the Reliability of LLMs

Researchers propose SciDC, a method that constrains large language model outputs using subject-specific scientific rules to reduce hallucinations and improve reliability. The approach demonstrates 12% average accuracy improvements across domain tasks including drug formulation, clinical diagnosis, and chemical synthesis planning.

AIBullisharXiv – CS AI · Mar 117/10
🧠

Logos: An evolvable reasoning engine for rational molecular design

Researchers introduce Logos, a compact AI model that combines multi-step logical reasoning with chemical consistency for molecular design. The model achieves strong performance in structural accuracy and chemical validity while using fewer parameters than larger language models, and provides transparent reasoning that can be inspected by humans.

AIBullisharXiv – CS AI · Mar 57/10
🧠

Phi-4-reasoning-vision-15B Technical Report

Researchers released Phi-4-reasoning-vision-15B, a compact open-weight multimodal AI model that combines vision and language capabilities with strong performance in scientific and mathematical reasoning. The model demonstrates that careful architecture design and high-quality data curation can enable smaller models to achieve competitive performance with less computational resources.

AINeutralarXiv – CS AI · Mar 57/10
🧠

Can Large Language Models Derive New Knowledge? A Dynamic Benchmark for Biological Knowledge Discovery

Researchers have developed DBench-Bio, a dynamic benchmark system that automatically evaluates AI's ability to discover new biological knowledge using a three-stage pipeline of data acquisition, question-answer extraction, and quality filtering. The benchmark addresses the critical problem of data contamination in static datasets and provides monthly updates across 12 biomedical domains, revealing current limitations in state-of-the-art AI models' knowledge discovery capabilities.

AIBullisharXiv – CS AI · Mar 57/10
🧠

MMAI Gym for Science: Training Liquid Foundation Models for Drug Discovery

Researchers introduce MMAI Gym for Science, a training framework for molecular foundation models in drug discovery. Their Liquid Foundation Model (LFM) outperforms larger general-purpose models on drug discovery tasks while being more efficient and specialized for molecular applications.

AIBullishGoogle DeepMind Blog · Oct 97/105
🧠

Demis Hassabis & John Jumper awarded Nobel Prize in Chemistry

Demis Hassabis and John Jumper have been awarded the Nobel Prize in Chemistry for developing AlphaFold, an AI system that predicts 3D protein structures from amino acid sequences. This recognition highlights the transformative impact of AI in scientific research and drug discovery.

AINeutralarXiv – CS AI · 2d ago6/10
🧠

CrystalXRD-Bench: Benchmarking Vision-Language Models for XRD Peak Indexing Across Diverse Crystalline Materials

Researchers introduced CrystalXRD-Bench, a 250-sample benchmark dataset for evaluating vision-language models on crystallographic peak indexing from X-ray diffraction patterns. Despite testing seven leading VLMs, the best model achieved only 37.6% exact-match accuracy, revealing significant gaps in how AI systems handle precise scientific figure interpretation and multi-step reasoning.

🧠 GPT-5
AINeutralarXiv – CS AI · 2d ago6/10
🧠

CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists

Researchers introduce CausaLab, a benchmarking environment that tests whether LLM agents can both solve causal discovery problems and accurately recover the underlying causal mechanisms. Experiments reveal a significant gap between prediction accuracy (92%) and structural causal model recovery (0.471 F1 score), exposing limitations in current AI systems' ability to perform rigorous scientific reasoning.

🧠 GPT-5
AINeutralarXiv – CS AI · 2d ago6/10
🧠

AtomWorld: A Benchmark for Evaluating Spatial Reasoning in Large Language Models on Crystalline Materials

Researchers introduced AtomWorld, a benchmark for evaluating how well large language models can perform spatial reasoning tasks in materials science, specifically atomic structure manipulation. The study reveals that current LLMs like Claude Opus 4.6 struggle with complex spatial operations, achieving success rates below 12% for rotation tasks, suggesting they function better as collaborative tools than autonomous scientific agents.

🧠 Claude🧠 Opus
AINeutralarXiv – CS AI · 2d ago6/10
🧠

Predicting Causal Effects from Natural Language Queries using Structured Representations

Researchers introduce Query2Effect, a 72,000-question benchmark for predicting causal effect sizes from natural language queries using LLMs. A two-step framework combining structured representation generation with supervised encoding reduces prediction error by 27-71% compared to standard LLMs, demonstrating that separating semantic interpretation from numerical estimation improves both in-domain performance and out-of-domain generalization.

AINeutralarXiv – CS AI · 2d ago6/10
🧠

OmniMatBench: A Human-Calibrated Multimodal Reasoning Benchmark Across 19 Materials Science Subfields

Researchers introduced OmniMatBench, a comprehensive multimodal reasoning benchmark containing 3,171 expert-curated problems across 19 materials science subfields. Evaluation of 13 major language models revealed significant gaps in AI reasoning capabilities, with the best model achieving only 37.2% accuracy, highlighting the need for improved scientific AI systems.

AINeutralarXiv – CS AI · 3d ago6/10
🧠

MetaboT: An LLM-based Multi-Agent Frameworkfor Interactive Analysis of Mass SpectrometryMetabolomics Knowledge Graphs

MetaboT is an open-source LLM-based framework that translates natural-language questions into SPARQL queries for metabolomics knowledge graphs, significantly lowering technical barriers for researchers without programming expertise. The multi-agent architecture addresses hallucination and schema-compliance issues through specialized agents for validation, entity resolution, and query refinement, validated on the Experimental Natural Products Knowledge Graph.

AIBullishGoogle DeepMind Blog · May 126/10
🧠

Co-Scientist: A multi-agent AI partner to accelerate research

Google has introduced Co-Scientist, a multi-agent AI system built on Gemini designed to assist researchers in accelerating scientific discovery. The tool represents a significant step in applying large language models to collaborative research workflows, potentially transforming how scientists approach complex problems.

Co-Scientist: A multi-agent AI partner to accelerate research
🧠 Gemini
AINeutralarXiv – CS AI · May 126/10
🧠

ASIA: an Autonomous System Identification Agent

ASIA is an autonomous AI agent framework that automates system identification tasks by delegating model selection, training algorithms, and hyperparameter tuning to a large language model. The framework eliminates manual trial-and-error processes in dynamical systems modeling, though empirical testing reveals concerns around test leakage and reproducibility.

AIBullisharXiv – CS AI · May 126/10
🧠

Can LLMs Predict Polymer Physics Just by Reading Synthesis and Processing Prose?

Researchers introduced PolyLM, a 9-billion-parameter language model that predicts polymer physical and mechanical properties directly from scientific literature without requiring structural chemical data. The model achieved a median R² of 0.74 across 22 diverse properties by training on 185,000 papers and 276,400 polymer samples, demonstrating that natural language processing can effectively capture the experimental context that traditional structure-only models miss.

AINeutralarXiv – CS AI · May 116/10
🧠

Can Agents Price a Reaction? Evaluating LLMs on Chemical Cost Reasoning

Researchers introduce ChemCost, a benchmark for evaluating LLM agents on chemical cost estimation from reaction descriptions. The study reveals that even frontier LLMs achieve only 50.6% accuracy on clean inputs and degrade significantly with realistic noise, exposing brittleness in parsing, evidence integration, and tool use despite access to domain-specific tools.

AINeutralarXiv – CS AI · May 116/10
🧠

LithoBench: Benchmarking Large Multimodal Models for Remote-Sensing Lithology Interpretation

LithoBench introduces a comprehensive benchmark dataset for evaluating large multimodal models on remote-sensing lithology interpretation, containing 10,000 expert-annotated instances across cognitive levels from identification to reasoning. The research reveals significant gaps in current vision-language models' ability to handle knowledge-intensive geological tasks, highlighting the challenges of applying general-purpose AI to specialized domain expertise.

AINeutralarXiv – CS AI · May 116/10
🧠

Adapting Vision-Language Models for Neutrino Event Classification in High-Energy Physics

Researchers have successfully adapted Vision-Language Models (VLMs) based on LLaMA 3.2 to classify neutrino events in high-energy physics detector data, demonstrating that transformer-based architectures outperform traditional CNNs while offering superior interpretability. This work showcases the broader applicability of large multimodal AI models beyond natural language processing to specialized scientific domains.

AINeutralarXiv – CS AI · Apr 146/10
🧠

LABBench2: An Improved Benchmark for AI Systems Performing Biology Research

Researchers have released LABBench2, an upgraded benchmark with nearly 1,900 tasks designed to measure AI systems' real-world capabilities in biology research beyond theoretical knowledge. The new benchmark shows current frontier models achieve 26-46% lower accuracy than on the original LAB-Bench, indicating significant progress in AI scientific abilities while highlighting substantial room for improvement.

$OP🏢 Hugging Face
Page 1 of 2Next →