AINeutralarXiv – CS AI · May 127/10
🧠Researchers identify a critical vulnerability in agentic memory systems where Large Language Models retrieve and amplify spurious correlations from stored information, leading to erroneous reasoning in downstream decisions. The study benchmarks this risk and proposes CAMEL, a lightweight calibration method that mitigates spurious pattern reliance while maintaining performance on clean data.
AIBullisharXiv – CS AI · May 117/10
🧠Researchers propose a novel uncertainty quantification method for Prior-Data Fitted Networks (PFNs), emerging foundation models for tabular data prediction, using martingale posteriors to provide calibrated confidence estimates. The technique is tuning-free, computationally efficient, and mathematically proven to converge, addressing a significant limitation in PFNs' practical applicability.
AIBullisharXiv – CS AI · Apr 147/10
🧠Researchers demonstrate that variational Bayesian methods significantly improve Vision Language Models' reliability for Visual Question Answering tasks by enabling selective prediction with reduced hallucinations and overconfidence. The proposed Variational VQA approach shows particular strength at low error tolerances and offers a practical path to making large multimodal models safer without proportional computational costs.
AIBullisharXiv – CS AI · Apr 137/10
🧠Researchers propose Evidential Transformation Network (ETN), a lightweight post-hoc module that converts pretrained models into evidential models for uncertainty estimation without retraining. ETN operates in logit space using sample-dependent affine transformations and Dirichlet distributions, demonstrating improved uncertainty quantification across vision and language benchmarks with minimal computational overhead.
AINeutralarXiv – CS AI · 2d ago6/10
🧠Researchers present a method to improve error prediction in Large Language Models by distinguishing between genuine model uncertainty and input ambiguity. Using uncertainty quantification metrics on question-answering tasks, they demonstrate that ambiguity information significantly enhances error prediction accuracy, yielding improvements exceeding 10 percentage points across multiple datasets and model families.
AINeutralarXiv – CS AI · May 286/10
🧠Researchers propose Sequential Bayesian Belief Tracking (SBBT), a framework for estimating the reliability of long reasoning chains in large language models before final answers are known. The study finds that probability calibration and ranking performance respond differently to various evidence types: scalar scores improve calibration metrics, while structural observations are needed for ranking tasks.
AINeutralarXiv – CS AI · May 286/10
🧠Researchers introduce EvaluatorDPT, a decision-control model that predicts YES, NO, or TBD (to-be-determined) for high-stakes AI applications where uncertainty exists. The system learns deferral as an explicit outcome rather than hiding uncertainty in forced predictions, achieving 82.6% accuracy with auditable, policy-governed decision routing that can be inspected and controlled at inference time.
AINeutralarXiv – CS AI · May 285/10
🧠Researchers propose Under-Cali, a machine learning framework for forecasting irregular multivariate time series data in real-time online settings. The system uses uncertainty estimation and dual-expert calibration to maintain accuracy despite dynamic data distribution shifts, achieving improvements over existing methods with minimal computational overhead.
AINeutralarXiv – CS AI · May 126/10
🧠Researchers introduce the Metacognitive Probe, a diagnostic tool measuring five dimensions of LLM confidence behavior including calibration, epistemic vigilance, and reasoning validation. Testing on eight frontier models and 69 humans reveals significant within-model disparities—exemplified by Gemini 2.5 Flash scoring 88 on confidence calibration but only 41 on difficulty prediction—suggesting composite benchmarks mask pockets of overconfidence.
🧠 Gemini
AINeutralarXiv – CS AI · May 126/10
🧠NoisyCoconut is an inference-time method that improves LLM reliability by injecting controlled noise into internal representations to generate diverse reasoning paths, enabling models to abstain when uncertain without requiring retraining. The technique reduces error rates from 40-70% to below 15% on mathematical reasoning tasks through unanimous agreement among noise-perturbed paths, offering practical reliability improvements compatible with existing models.
AINeutralarXiv – CS AI · May 116/10
🧠Researchers propose using multidimensional self-assessment based on cognitive appraisal theory to predict LLM failures more reliably than confidence alone. Testing across 12 models and 38 tasks, they find effort and ability dimensions consistently outperform confidence, with task type determining which dimension proves most predictive.
AINeutralarXiv – CS AI · May 76/10
🧠Researchers propose Adaptive Conformal Semantic Entropy (ACSE), a novel method for quantifying uncertainty in large language model outputs by measuring semantic diversity rather than relying solely on lexical or probabilistic measures. The approach uses conformal calibration to provide statistical guarantees on error rates, demonstrating significant performance improvements over existing uncertainty quantification baselines.
AINeutralarXiv – CS AI · Apr 136/10
🧠Researchers introduce VOLTA, a simplified deep learning approach for uncertainty quantification that outperforms ten established baselines including ensemble methods and MC Dropout. The method achieves superior calibration with expected calibration error of 0.010 and competitive accuracy across multiple datasets, suggesting that complex auxiliary losses may be unnecessary for reliable uncertainty estimation in safety-critical applications.
AIBullisharXiv – CS AI · Apr 106/10
🧠Researchers demonstrate that Large Language Models used as judges suffer from score range bias, where evaluation outputs are highly sensitive to predefined scoring scales. Using contrastive decoding techniques, they achieve up to 11.7% improvement in alignment with human judgments across different score ranges.
AINeutralarXiv – CS AI · Mar 276/10
🧠Researchers introduce a new framework to evaluate how well Large Language Models understand their own knowledge limitations, finding that traditional confidence metrics miss key differences between models. The study reveals that models showing similar accuracy can have vastly different metacognitive abilities - their capacity to know what they don't know.
🧠 Llama
AINeutralarXiv – CS AI · Mar 35/104
🧠Researchers developed a conformal prediction framework for Large Language Models used in medical entity extraction, testing on FDA drug labels and radiology reports. The study found that model calibration varies significantly across clinical domains, with models being underconfident on structured data but overconfident on free-text reports, achieving 90% target coverage with 9-13% rejection rates.