AIBearisharXiv – CS AI · Jun 97/10
🧠A major peer-reviewed study of 6,749 scientists evaluated AI-generated research ideas and found that large language models lack imagination in scientific discovery, struggle to propose null hypotheses, and show weak agreement with human expert judgment. The research reveals significant limitations in AI's ability to accelerate science despite widespread industry optimism.
AIBullisharXiv – CS AI · Jun 27/10
🧠Researchers introduce PiEvo, a framework that enables AI scientific agents to autonomously evolve their underlying scientific principles rather than search within fixed hypothesis spaces. The system achieves 29.7-31.1% improvement in solution quality and 83.3% faster convergence by treating scientific discovery as Bayesian optimization over an expanding principle space.
AINeutralarXiv – CS AI · May 117/10
🧠Researchers introduce a scenario-grounded benchmark for evaluating large language models in scientific discovery, revealing significant performance gaps compared to general science benchmarks. The framework tests LLMs across biology, chemistry, materials, and physics through project-level tasks involving hypothesis generation and experimental design, showing that current models remain distant from achieving general scientific superintelligence despite demonstrating promise in specific applications.
AINeutralarXiv – CS AI · Jun 106/10
🧠Researchers propose a new evolutionary framework for using large language models to generate diverse, high-quality scientific hypotheses by reformulating the search as a sampling problem inspired by parallel tempering. The approach addresses a critical limitation where traditional optimization-focused methods collapse into homogeneous solutions, enabling scientists to maintain multiple robust candidate hypotheses under fixed validation budgets across molecular, equation, and algorithm discovery domains.
AINeutralarXiv – CS AI · Jun 96/10
🧠Researchers introduce DN-Hypo-Pipeline, an AI workflow leveraging large language models to automate scientific hypothesis generation from existing research literature. The system reconstructs novel explanations for observed phenomena and was validated in data science modeling, with two generated hypotheses producing algorithms that outperformed baseline models from the original papers.
AINeutralarXiv – CS AI · Jun 26/10
🧠Researchers compared how human children and large language models approach inductive reasoning tasks under uncertainty, finding both similarities and critical differences in their information-seeking strategies. While LLMs replicate children's adaptive responses to environmental structure, they exhibit distinct biases toward over-observation and instruction compliance, suggesting fundamentally different underlying computational principles govern their decision-making.
AINeutralarXiv – CS AI · Jun 16/10
🧠HypoAgent is a new AI framework that uses multiple specialized agents to generate logical hypotheses from knowledge graphs through interactive dialogue. The system excels at understanding evolving user intent across multi-turn conversations and diagnosing why generated hypotheses fail, achieving state-of-the-art performance on both commonsense and biomedical knowledge graphs.
AINeutralarXiv – CS AI · May 296/10
🧠Researchers introduce ProjectionBench, a novel evaluation framework that tests large language models' scientific discovery capabilities by progressively revealing information about research problems. The benchmark assesses both innovative reasoning with minimal context and grounded hypothesis generation with full experimental details across 45 materials science papers, finding that GPT-5.4 and Gemini 3.1 Pro achieve strong alignment with ground-truth conclusions.
🧠 GPT-5🧠 Gemini
AINeutralarXiv – CS AI · May 296/10
🧠MOOSE-Copilot introduces a unified framework for scientific hypothesis discovery that combines exploratory ideation with fine-grained refinement through structured human-AI interaction. The web-based system enables scientists to guide LLM-powered discovery processes via initial blueprints, routing decisions, and feedback mechanisms, outperforming autonomous baselines while lowering accessibility barriers through an intuitive visual interface.
🏢 Microsoft
AINeutralarXiv – CS AI · May 276/10
🧠Researchers evaluated how knowledge graphs (KGs) influence hypothesis generation in large language models across multiple models, finding that compact subgraphs often perform comparably to full graphs. The study reveals that KG utility is selective and model-dependent, with useful signal often recoverable from structured, compressed subsets rather than complete local graphs.
🧠 Gemini🧠 Llama
AIBullishArs Technica – AI · May 196/10
🧠Two AI-based science assistants have demonstrated success in drug-retargeting tasks, with both tools capable of generating hypotheses and one additionally analyzing relevant data. This advancement showcases AI's growing role in accelerating pharmaceutical research and drug discovery processes.
AINeutralarXiv – CS AI · Mar 276/10
🧠Researchers evaluated whether large language models follow Occam's Razor principle when performing inductive and abductive reasoning, finding that while LLMs can handle simple scenarios, they struggle with complex world models and producing high-quality, simplified hypotheses. The study introduces a new framework for generating reasoning questions and an automated metric to assess hypothesis quality based on correctness and simplicity.