y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#hypothesis-testing News & Analysis

8 articles tagged with #hypothesis-testing. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

8 articles
AINeutralarXiv – CS AI · Jun 116/10
🧠

StatefulDiscovery: Evidence-Calibrated Claim Formation in Open-Ended Scientific Discovery

Researchers introduce StatefulDiscovery, a framework that enables AI agents to conduct open-ended scientific discovery by maintaining explicit investigation state and coupling it with evidence-calibrated claim formation. The system addresses the challenge of avoiding overinterpretation by coordinating exploration trajectory with evidential support, demonstrated across 40 real-data tasks where it outperformed baseline approaches in producing well-supported, high-value claims.

AINeutralarXiv – CS AI · Jun 96/10
🧠

PACE: Anytime-Valid Acceptance Tests for Self-Evolving Agents

Researchers introduce PACE, a statistical testing framework that prevents self-evolving AI agents from committing false improvements to their own prompts and workflows. Unlike naive greedy acceptance rules that accumulate errors through repeated testing, PACE uses sequential hypothesis testing to distinguish genuine improvements from noise, reducing harmful modifications by 30-42% while maintaining accuracy at lower computational cost.

AIBullisharXiv – CS AI · Jun 96/10
🧠

Margin-Adaptive Confidence Ranking for Reliable LLM Judgement

Researchers address a critical flaw in LLM confidence estimation for achieving human-AI agreement by developing a learned confidence estimator with theoretical generalization guarantees. This approach improves upon prior methods that assume confidence monotonically correlates with disagreement risk, offering practical benefits for aligning AI systems with human judgment.

AINeutralarXiv – CS AI · Jun 46/10
🧠

FALSIFYBENCH: Evaluating Inductive Reasoning in LLMs with Rule Discovery Games

Researchers introduce FALSIFYBENCH, an evaluation framework that tests whether large language models can perform inductive reasoning through hypothesis-driven discovery tasks. Testing 12 LLMs reveals that reasoning models outperform instruction-tuned models, with success primarily driven by the ability to actively falsify hypotheses rather than confirm them.

AINeutralarXiv – CS AI · Jun 16/10
🧠

Auto-Discovery-Bench: Diagnosing Structured State Tracking in Oracle-Guided Discovery

Researchers introduce Auto-Discovery-Bench, a diagnostic benchmark that tests AI agents' ability to maintain and update structured beliefs through iterative hypothesis-intervention-feedback cycles. The benchmark reveals that performance degrades significantly with increased complexity variables, and identifies limitations in long-range structured information integration as a key bottleneck for scientific discovery agents.

AINeutralarXiv – CS AI · May 286/10
🧠

When prompt perturbations break your A/B test: A valid statistical test for generative surveying

Researchers demonstrate that standard statistical hypothesis tests fail when applied to generative surveying, where LLM-based personas provide market research feedback. The study proposes a valid permutation test that accounts for prompt sensitivity and provides guidance on optimal resource allocation for this emerging research methodology.

AINeutralarXiv – CS AI · May 116/10
🧠

Adaptive auditing of AI systems with anytime-valid guarantees

Researchers introduce an adaptive auditing framework for AI systems that maintains statistical rigor while evaluating generative AI failure modes with limited observations. Using Safe Anytime-Valid Inference, the method enables auditors to draw reliable conclusions from as few as 20 test cases through sequential hypothesis testing, addressing a critical bottleneck in AI safety evaluation.

AINeutralarXiv – CS AI · Mar 34/103
🧠

In-Context Learning for Pure Exploration

Researchers introduce In-Context Pure Explorer (ICPE), a Transformer-based model that learns to actively collect data and identify correct hypotheses in sequential testing problems without parameter updates. The model demonstrates competitive performance across various benchmarks including multi-armed bandit problems and generalized search tasks.