AIBullisharXiv – CS AI · 3d ago7/10
🧠Researchers introduce MolLingo, a multi-agent AI system that automates molecular design by coordinating specialized agents through shared memory and domain-specific tools. The system uses BRICS-based Fragment Enumeration to represent molecules in chemically meaningful ways that LLMs can reason about effectively, achieving superior performance on drug design benchmarks compared to frontier models like GPT-5.
🧠 GPT-5
AIBullisharXiv – CS AI · 3d ago7/10
🧠AIBuildAI-2 introduces a knowledge-enhanced AI agent that automatically builds machine learning models by combining large language models with an external, evolving knowledge system. The system achieves state-of-the-art performance, ranking first on MLE-Bench and placing in the top 6.6% of human teams in a predictive competition, democratizing AI model development for non-specialists.
AIBullisharXiv – CS AI · 4d ago7/10
🧠ScientistOne introduces Chain-of-Evidence, a verifiability framework addressing critical failures in autonomous research systems where AI agents produce plausible-looking but unreliable outputs including fabricated citations, unverified scores, and misaligned methods. The system achieves zero hallucinated references and perfect score verification across five research tasks, significantly outperforming existing baseline systems that exhibit systematic failure rates up to 80%.
AIBullisharXiv – CS AI · 4d ago7/10
🧠AutoDFT is a closed-loop multi-agent framework that automates density functional theory (DFT) calculations by embedding LLM reasoning throughout the entire computational lifecycle, rather than just the planning phase. The system achieves 94.1% success on a 34-task benchmark and enables non-experts to obtain reliable computational chemistry results by dynamically adapting to failures and unexpected outcomes.
🧠 GPT-5
AIBullishMIT Technology Review · May 227/10
🧠During Google I/O, DeepMind CEO Demis Hassabis stated we are approaching the "singularity," signaling that AI-driven scientific advancement is accelerating rapidly. The keynote highlighted Google's positioning of AI as a transformative force for research and development across industries.
🏢 Google
AIBearisharXiv – CS AI · May 127/10
🧠Researchers introduced MDGYM, a benchmark testing AI agents' ability to autonomously execute molecular dynamics simulations, finding that even the strongest systems solve only 21% of easy tasks. The poor performance reveals that advanced code generation does not translate to physical reasoning, exposing a critical gap between general software engineering competence and domain-specific scientific workflows.
🧠 Claude
AIBullisharXiv – CS AI · Apr 107/10
🧠Researchers propose SciDC, a method that constrains large language model outputs using subject-specific scientific rules to reduce hallucinations and improve reliability. The approach demonstrates 12% average accuracy improvements across domain tasks including drug formulation, clinical diagnosis, and chemical synthesis planning.
AIBullisharXiv – CS AI · Mar 117/10
🧠Researchers introduce Logos, a compact AI model that combines multi-step logical reasoning with chemical consistency for molecular design. The model achieves strong performance in structural accuracy and chemical validity while using fewer parameters than larger language models, and provides transparent reasoning that can be inspected by humans.
AIBullisharXiv – CS AI · Mar 57/10
🧠Researchers released Phi-4-reasoning-vision-15B, a compact open-weight multimodal AI model that combines vision and language capabilities with strong performance in scientific and mathematical reasoning. The model demonstrates that careful architecture design and high-quality data curation can enable smaller models to achieve competitive performance with less computational resources.
AINeutralarXiv – CS AI · Mar 57/10
🧠Researchers have developed DBench-Bio, a dynamic benchmark system that automatically evaluates AI's ability to discover new biological knowledge using a three-stage pipeline of data acquisition, question-answer extraction, and quality filtering. The benchmark addresses the critical problem of data contamination in static datasets and provides monthly updates across 12 biomedical domains, revealing current limitations in state-of-the-art AI models' knowledge discovery capabilities.
AIBullisharXiv – CS AI · Mar 57/10
🧠Researchers introduce MMAI Gym for Science, a training framework for molecular foundation models in drug discovery. Their Liquid Foundation Model (LFM) outperforms larger general-purpose models on drug discovery tasks while being more efficient and specialized for molecular applications.
AIBullishGoogle DeepMind Blog · Oct 97/105
🧠Demis Hassabis and John Jumper have been awarded the Nobel Prize in Chemistry for developing AlphaFold, an AI system that predicts 3D protein structures from amino acid sequences. This recognition highlights the transformative impact of AI in scientific research and drug discovery.
AINeutralarXiv – CS AI · 2d ago6/10
🧠Researchers introduced CrystalXRD-Bench, a 250-sample benchmark dataset for evaluating vision-language models on crystallographic peak indexing from X-ray diffraction patterns. Despite testing seven leading VLMs, the best model achieved only 37.6% exact-match accuracy, revealing significant gaps in how AI systems handle precise scientific figure interpretation and multi-step reasoning.
🧠 GPT-5
AINeutralarXiv – CS AI · 2d ago6/10
🧠Researchers introduce CausaLab, a benchmarking environment that tests whether LLM agents can both solve causal discovery problems and accurately recover the underlying causal mechanisms. Experiments reveal a significant gap between prediction accuracy (92%) and structural causal model recovery (0.471 F1 score), exposing limitations in current AI systems' ability to perform rigorous scientific reasoning.
🧠 GPT-5
AINeutralarXiv – CS AI · 2d ago6/10
🧠Researchers introduced AtomWorld, a benchmark for evaluating how well large language models can perform spatial reasoning tasks in materials science, specifically atomic structure manipulation. The study reveals that current LLMs like Claude Opus 4.6 struggle with complex spatial operations, achieving success rates below 12% for rotation tasks, suggesting they function better as collaborative tools than autonomous scientific agents.
🧠 Claude🧠 Opus
AINeutralarXiv – CS AI · 2d ago6/10
🧠Researchers introduce Query2Effect, a 72,000-question benchmark for predicting causal effect sizes from natural language queries using LLMs. A two-step framework combining structured representation generation with supervised encoding reduces prediction error by 27-71% compared to standard LLMs, demonstrating that separating semantic interpretation from numerical estimation improves both in-domain performance and out-of-domain generalization.
AINeutralarXiv – CS AI · 2d ago6/10
🧠Researchers introduced OmniMatBench, a comprehensive multimodal reasoning benchmark containing 3,171 expert-curated problems across 19 materials science subfields. Evaluation of 13 major language models revealed significant gaps in AI reasoning capabilities, with the best model achieving only 37.2% accuracy, highlighting the need for improved scientific AI systems.
AINeutralarXiv – CS AI · 3d ago6/10
🧠MetaboT is an open-source LLM-based framework that translates natural-language questions into SPARQL queries for metabolomics knowledge graphs, significantly lowering technical barriers for researchers without programming expertise. The multi-agent architecture addresses hallucination and schema-compliance issues through specialized agents for validation, entity resolution, and query refinement, validated on the Experimental Natural Products Knowledge Graph.
AIBullishGoogle DeepMind Blog · May 126/10
🧠Google has introduced Co-Scientist, a multi-agent AI system built on Gemini designed to assist researchers in accelerating scientific discovery. The tool represents a significant step in applying large language models to collaborative research workflows, potentially transforming how scientists approach complex problems.
🧠 Gemini
AINeutralarXiv – CS AI · May 126/10
🧠ASIA is an autonomous AI agent framework that automates system identification tasks by delegating model selection, training algorithms, and hyperparameter tuning to a large language model. The framework eliminates manual trial-and-error processes in dynamical systems modeling, though empirical testing reveals concerns around test leakage and reproducibility.
AIBullisharXiv – CS AI · May 126/10
🧠Researchers introduced PolyLM, a 9-billion-parameter language model that predicts polymer physical and mechanical properties directly from scientific literature without requiring structural chemical data. The model achieved a median R² of 0.74 across 22 diverse properties by training on 185,000 papers and 276,400 polymer samples, demonstrating that natural language processing can effectively capture the experimental context that traditional structure-only models miss.
AINeutralarXiv – CS AI · May 116/10
🧠Researchers introduce ChemCost, a benchmark for evaluating LLM agents on chemical cost estimation from reaction descriptions. The study reveals that even frontier LLMs achieve only 50.6% accuracy on clean inputs and degrade significantly with realistic noise, exposing brittleness in parsing, evidence integration, and tool use despite access to domain-specific tools.
AINeutralarXiv – CS AI · May 116/10
🧠LithoBench introduces a comprehensive benchmark dataset for evaluating large multimodal models on remote-sensing lithology interpretation, containing 10,000 expert-annotated instances across cognitive levels from identification to reasoning. The research reveals significant gaps in current vision-language models' ability to handle knowledge-intensive geological tasks, highlighting the challenges of applying general-purpose AI to specialized domain expertise.
AINeutralarXiv – CS AI · May 116/10
🧠Researchers have successfully adapted Vision-Language Models (VLMs) based on LLaMA 3.2 to classify neutrino events in high-energy physics detector data, demonstrating that transformer-based architectures outperform traditional CNNs while offering superior interpretability. This work showcases the broader applicability of large multimodal AI models beyond natural language processing to specialized scientific domains.
AINeutralarXiv – CS AI · Apr 146/10
🧠Researchers have released LABBench2, an upgraded benchmark with nearly 1,900 tasks designed to measure AI systems' real-world capabilities in biology research beyond theoretical knowledge. The new benchmark shows current frontier models achieve 26-46% lower accuracy than on the original LAB-Bench, indicating significant progress in AI scientific abilities while highlighting substantial room for improvement.
$OP🏢 Hugging Face