#statistical-testing News & Analysis

11 articles tagged with #statistical-testing. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

11 articles

AIBearisharXiv – CS AI · Apr 147/10

🧠

Sanity Checks for Agentic Data Science

Researchers propose lightweight sanity checks for agentic data science (ADS) systems to detect falsely optimistic conclusions that users struggle to identify. Using the Predictability-Computability-Stability framework, the checks expose whether AI agents like OpenAI Codex reliably distinguish signal from noise. Testing on 11 real datasets reveals that over half produced unsupported affirmative conclusions despite individual runs suggesting otherwise.

🏢 OpenAI

AIBullisharXiv – CS AI · Mar 46/104

🧠

AgentAssay: Token-Efficient Regression Testing for Non-Deterministic AI Agent Workflows

Researchers introduce AgentAssay, the first framework for regression testing AI agent workflows, achieving 78-100% cost reduction while maintaining statistical guarantees. The system uses behavioral fingerprinting and stochastic testing methods to detect regressions in autonomous AI agents across multiple models including GPT-5.2, Claude Sonnet 4.6, and others.

AINeutralarXiv – CS AI · Jun 236/10

🧠

SPADE: Structure-Prior Adaptive Decision Estimation

SPADE introduces a machine learning framework that adaptively decides whether to enforce physical-structure priors (conservation laws, Hamiltonian forms) based on data evidence, using statistical tests and shrinkage estimation. The method automatically calibrates prior enforcement strength and selects among competing structures, achieving oracle-level performance while reducing computational overhead compared to cross-validation approaches.

AINeutralarXiv – CS AI · Jun 236/10

🧠

Where Is My Physics Wrong? Localized and Identifiable Discovery of Model Discrepancy

Researchers introduce LISDD, a framework for identifying where and why physics-based models fail by localizing errors to specific operating regimes and discovering sparse symbolic corrections. The method outperforms existing global-correction approaches by keeping parameter bias near zero while maintaining statistical rigor through finite-sample testing.

AINeutralarXiv – CS AI · Jun 46/10

🧠

Counterfactual Explanations for Deep Two-Sample Testing

Researchers propose a counterfactual explanation framework for deep two-sample testing that generates interpretable edits to show which data features drive statistical differences between groups. The method combines diffusion autoencoders with deep learning models to produce plausible sample transformations that reduce distributional discrepancies, validated on synthetic data and MRI cohorts.

AINeutralarXiv – CS AI · Jun 16/10

🧠

Conditional Coverage Diagnostics for Conformal Prediction

Researchers introduce Excess Risk of Target Coverage (ERT), a new metric framework for evaluating conditional coverage in conformal prediction systems. The approach reformulates coverage assessment as a classification problem, providing more statistically powerful diagnostics than existing methods while offering conservative estimates of miscoverage and enabling distinction between over- and under-coverage effects.

AINeutralarXiv – CS AI · May 286/10

🧠

A Fixed-Budget, Cluster-Aware Standard for LLM-as-a-Judge Evaluation: A Multi-Hop RAG Stress Test

Researchers propose a standardized measurement protocol for evaluating retrieval-augmented generation (RAG) systems using LLM judges, addressing inconsistencies in how semantic search quality is assessed. The standard fixes key variables like evidence budget and prompt while requiring cluster-aware statistical testing, revealing that previous comparisons may have overstated progress and that traditional BM25 retrieval outperforms pure semantic methods under controlled conditions.

AINeutralarXiv – CS AI · May 286/10

🧠

Entropy Distribution as a Fingerprint for Hallucinations in Generative Models

Researchers propose Calibrated Entropy Score (CES), a novel method for detecting hallucinations in large language models using entropy distribution patterns from a single forward pass. The technique achieves performance comparable to computationally expensive multi-sample methods while requiring only black-box access to token logits, with formal mathematical guarantees for detection accuracy.

🏢 Perplexity

AINeutralarXiv – CS AI · May 126/10

🧠

Probing Routing-Conditional Calibration in Attention-Residual Transformers

Researchers question whether routing traces in Attention-Residual transformers provide genuine evidence of improved post-hoc calibration beyond standard confidence metrics. Through rigorous statistical testing with matched controls, the study finds that routing-specific features offer minimal stable evidence of better calibration, suggesting previous claims of calibration improvements may reflect methodological artifacts rather than true model improvements.

AINeutralarXiv – CS AI · May 76/10

🧠

When LLMs get significantly worse: A statistical approach to detect model degradations

Researchers propose a statistical framework using McNemar's test to reliably detect when large language model optimizations cause actual performance degradation versus noise. The method enables detection of even small accuracy drops (0.3%) while avoiding false alarms on theoretically lossless optimizations, with implementation provided for the LM Evaluation Harness.

AINeutralarXiv – CS AI · Mar 27/1013

🧠

Efficient Ensemble Conditional Independence Test Framework for Causal Discovery

Researchers introduce E-CIT (Ensemble Conditional Independence Test), a new framework that significantly reduces computational costs in causal discovery by partitioning data into subsets and aggregating results. The method achieves linear computational complexity while maintaining competitive performance, particularly on real-world datasets.