AIBullisharXiv – CS AI · Jun 57/10
🧠Researchers introduced PRECISE, a method combining human annotations with LLM judgments to produce statistically reliable ranking evaluation metrics. The approach reduces computational complexity for hierarchical metrics like Precision@K and demonstrated 21% error reduction on benchmarks, with real-world validation showing a +407 basis points sales lift in production systems.
🧠 Claude
AIBullisharXiv – CS AI · Jun 16/10
🧠Researchers introduce GLIDE, an open-source Python library that standardizes prediction-powered inference (PPI) methods for evaluating AI systems and language models. The library combines human annotation with LLM evaluations to produce unbiased estimates with valid confidence intervals, potentially reducing annotation costs while maintaining accuracy.
AINeutralarXiv – CS AI · Jun 16/10
🧠Researchers propose an anytime-valid inference method to correct split selection in decision trees used for streaming data, addressing a critical statistical gap where existing Hoeffding Trees lack valid guarantees despite empirical success. The approach provides false-split control across arbitrary data streams while producing smaller, more efficient trees than current methods.
AINeutralarXiv – CS AI · May 276/10
🧠Researchers introduce Structure-Adaptive Conformal Inference (SCQ and P-TAMS), a statistical framework that improves out-of-distribution testing in machine learning by incorporating auxiliary structural information like spatiotemporal patterns. The approach provides finite-sample error-rate control and enhanced interpretability compared to traditional conformal methods, with applications in high-stakes prediction scenarios.
AINeutralarXiv – CS AI · May 125/10
🧠Researchers introduce weighted rules under stable model semantics, combining logic programming with probabilistic methods similar to Markov Logic Networks. This advancement enables answer set programs to handle inconsistencies, rank solutions, assign probabilities, and perform statistical inference—moving beyond the deterministic limitations of traditional logic-based systems.
AINeutralarXiv – CS AI · May 116/10
🧠Researchers introduce an adaptive auditing framework for AI systems that maintains statistical rigor while evaluating generative AI failure modes with limited observations. Using Safe Anytime-Valid Inference, the method enables auditors to draw reliable conclusions from as few as 20 test cases through sequential hypothesis testing, addressing a critical bottleneck in AI safety evaluation.
AINeutralarXiv – CS AI · May 115/10
🧠This academic survey examines statistical inference methods within the belief functions framework, a mathematical approach for characterizing uncertainty when insufficient data prevents traditional probability distribution learning. The work reviews key contributions to inferring belief measures from statistical data, offering theoretical foundations relevant to uncertainty quantification in data-sparse environments.
AINeutralarXiv – CS AI · May 96/10
🧠Researchers propose CITE, an algorithm that enables reliable certification of Large Language Model outputs through multiple sampling while controlling error rates under data-dependent stopping conditions. The method addresses a critical challenge in LLM reliability by providing statistical guarantees without requiring advance knowledge of possible answer categories.
AINeutralarXiv – CS AI · May 96/10
🧠Researchers propose SIREN, a new evaluation protocol that corrects for the 'winner's curse' bias in large language model benchmarking. This addresses a critical flaw where reusing benchmark items during model tuning inflates performance estimates, potentially leading to flawed deployment decisions based on unreliable comparisons.