#statistical-inference News & Analysis

9 articles tagged with #statistical-inference. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

9 articles

AIBullisharXiv – CS AI · Jun 57/10

🧠

Statistically Reliable LLM-Based Ranking Evaluation via Prediction-Powered Inference

Researchers introduced PRECISE, a method combining human annotations with LLM judgments to produce statistically reliable ranking evaluation metrics. The approach reduces computational complexity for hierarchical metrics like Precision@K and demonstrated 21% error reduction on benchmarks, with real-world validation showing a +407 basis points sales lift in production systems.

🧠 Claude

AIBullisharXiv – CS AI · Jun 16/10

🧠

Industrializing Prediction-Powered Inference: The GLIDE Library for Reliable GenAI and Agentic Systems Evaluation

Researchers introduce GLIDE, an open-source Python library that standardizes prediction-powered inference (PPI) methods for evaluating AI systems and language models. The library combines human annotation with LLM evaluations to produce unbiased estimates with valid confidence intervals, potentially reducing annotation costs while maintaining accuracy.

AINeutralarXiv – CS AI · Jun 16/10

🧠

Correcting Split Selection in Online Decision Trees via Anytime-Valid Inference

Researchers propose an anytime-valid inference method to correct split selection in decision trees used for streaming data, addressing a critical statistical gap where existing Hoeffding Trees lack valid guarantees despite empirical success. The approach provides false-split control across arbitrary data streams while producing smaller, more efficient trees than current methods.

AINeutralarXiv – CS AI · May 276/10

🧠

Structure-Adaptive Conformal Inference for Large-Scale Out-of-Distribution Testing

Researchers introduce Structure-Adaptive Conformal Inference (SCQ and P-TAMS), a statistical framework that improves out-of-distribution testing in machine learning by incorporating auxiliary structural information like spatiotemporal patterns. The approach provides finite-sample error-rate control and enhanced interpretability compared to traditional conformal methods, with applications in high-stakes prediction scenarios.

AINeutralarXiv – CS AI · May 125/10

🧠

Weighted Rules under the Stable Model Semantics

Researchers introduce weighted rules under stable model semantics, combining logic programming with probabilistic methods similar to Markov Logic Networks. This advancement enables answer set programs to handle inconsistencies, rank solutions, assign probabilities, and perform statistical inference—moving beyond the deterministic limitations of traditional logic-based systems.

AINeutralarXiv – CS AI · May 116/10

🧠

Adaptive auditing of AI systems with anytime-valid guarantees

Researchers introduce an adaptive auditing framework for AI systems that maintains statistical rigor while evaluating generative AI failure modes with limited observations. Using Safe Anytime-Valid Inference, the method enables auditors to draw reliable conclusions from as few as 20 test cases through sequential hypothesis testing, addressing a critical bottleneck in AI safety evaluation.

AINeutralarXiv – CS AI · May 115/10

🧠

Statistical inference with belief functions: A survey

This academic survey examines statistical inference methods within the belief functions framework, a mathematical approach for characterizing uncertainty when insufficient data prevents traditional probability distribution learning. The work reviews key contributions to inferring belief measures from statistical data, offering theoretical foundations relevant to uncertainty quantification in data-sparse environments.

AINeutralarXiv – CS AI · May 96/10

🧠

CITE: Anytime-Valid Statistical Inference in LLM Self-Consistency

Researchers propose CITE, an algorithm that enables reliable certification of Large Language Model outputs through multiple sampling while controlling error rates under data-dependent stopping conditions. The method addresses a critical challenge in LLM reliability by providing statistical guarantees without requiring advance knowledge of possible answer categories.

AINeutralarXiv – CS AI · May 96/10

🧠

Towards Reliable LLM Evaluation: Correcting the Winner's Curse in Adaptive Benchmarking

Researchers propose SIREN, a new evaluation protocol that corrects for the 'winner's curse' bias in large language model benchmarking. This addresses a critical flaw where reusing benchmark items during model tuning inflates performance estimates, potentially leading to flawed deployment decisions based on unreliable comparisons.