AINeutralarXiv – CS AI · Jun 116/10
🧠Researchers introduce StatefulDiscovery, a framework that enables AI agents to conduct open-ended scientific discovery by maintaining explicit investigation state and coupling it with evidence-calibrated claim formation. The system addresses the challenge of avoiding overinterpretation by coordinating exploration trajectory with evidential support, demonstrated across 40 real-data tasks where it outperformed baseline approaches in producing well-supported, high-value claims.
AINeutralarXiv – CS AI · Jun 96/10
🧠Researchers introduce PACE, a statistical testing framework that prevents self-evolving AI agents from committing false improvements to their own prompts and workflows. Unlike naive greedy acceptance rules that accumulate errors through repeated testing, PACE uses sequential hypothesis testing to distinguish genuine improvements from noise, reducing harmful modifications by 30-42% while maintaining accuracy at lower computational cost.
AIBullisharXiv – CS AI · Jun 96/10
🧠Researchers address a critical flaw in LLM confidence estimation for achieving human-AI agreement by developing a learned confidence estimator with theoretical generalization guarantees. This approach improves upon prior methods that assume confidence monotonically correlates with disagreement risk, offering practical benefits for aligning AI systems with human judgment.
AINeutralarXiv – CS AI · Jun 46/10
🧠Researchers introduce FALSIFYBENCH, an evaluation framework that tests whether large language models can perform inductive reasoning through hypothesis-driven discovery tasks. Testing 12 LLMs reveals that reasoning models outperform instruction-tuned models, with success primarily driven by the ability to actively falsify hypotheses rather than confirm them.
AINeutralarXiv – CS AI · Jun 16/10
🧠Researchers introduce Auto-Discovery-Bench, a diagnostic benchmark that tests AI agents' ability to maintain and update structured beliefs through iterative hypothesis-intervention-feedback cycles. The benchmark reveals that performance degrades significantly with increased complexity variables, and identifies limitations in long-range structured information integration as a key bottleneck for scientific discovery agents.
AINeutralarXiv – CS AI · May 286/10
🧠Researchers demonstrate that standard statistical hypothesis tests fail when applied to generative surveying, where LLM-based personas provide market research feedback. The study proposes a valid permutation test that accounts for prompt sensitivity and provides guidance on optimal resource allocation for this emerging research methodology.
AINeutralarXiv – CS AI · May 116/10
🧠Researchers introduce an adaptive auditing framework for AI systems that maintains statistical rigor while evaluating generative AI failure modes with limited observations. Using Safe Anytime-Valid Inference, the method enables auditors to draw reliable conclusions from as few as 20 test cases through sequential hypothesis testing, addressing a critical bottleneck in AI safety evaluation.
AINeutralarXiv – CS AI · Mar 34/103
🧠Researchers introduce In-Context Pure Explorer (ICPE), a Transformer-based model that learns to actively collect data and identify correct hypotheses in sequential testing problems without parameter updates. The model demonstrates competitive performance across various benchmarks including multi-armed bandit problems and generalized search tasks.