AINeutralarXiv – CS AI · 2d ago7/10
🧠Researchers identify 'retrieval-state lock-in,' a failure mode in retrieval-augmented generation (RAG) systems where multiple sampled answers agree despite being wrong because they condition on the same defective retrieval state. The study proposes decomposing confidence scores into three components—answer surface, evidence, and retrieval state—achieving 91.9% precision by requiring all three to agree, though this certifies only 7.7% of answers as low-risk.
AIBullisharXiv – CS AI · 6d ago7/10
🧠Researchers demonstrate that multimodal large language models (MLLMs) struggle with confidence calibration in medical tasks, where their stated confidence often misaligns with actual accuracy. A new method combining Multi-Strategy Fusion-Based Interrogation with expert LLM assessment reduces calibration error by 40% across medical VQA datasets, addressing critical reliability concerns for AI-assisted diagnosis.
AINeutralarXiv – CS AI · May 287/10
🧠Researchers demonstrate that Large Language Model (LLM) confidence calibration measurements are highly sensitive to methodological choices, including how answers are selected, token probabilities are calculated, and conditioning contexts are applied. The study reveals that verbalized confidence often reflects answer plausibility rather than actual correctness, challenging assumptions about LLM uncertainty quantification.
AIBullisharXiv – CS AI · May 117/10
🧠Researchers introduce CASPO, a framework that improves reasoning reliability in large language models by aligning token-level confidence with step-wise logical correctness through preference optimization. The method achieves better performance than tree-search approaches without requiring separate reward models, while introducing CaT inference that dynamically prunes uncertain reasoning branches with minimal computational overhead.
AIBearisharXiv – CS AI · Mar 127/10
🧠A new study reveals that large language models exhibit patterns similar to the Dunning-Kruger effect, where poorly performing AI models show severe overconfidence in their abilities. The research tested four major models across 24,000 trials, finding that Kimi K2 displayed the worst calibration with 72.6% overconfidence despite only 23.3% accuracy, while Claude Haiku 4.5 achieved the best performance with proper confidence calibration.
🧠 Claude🧠 Haiku🧠 Gemini
AINeutralarXiv – CS AI · Feb 277/106
🧠Researchers propose a new framework for collective decision-making where AI agents can abstain from voting when uncertain, extending the Condorcet Jury Theorem to confidence-gated settings. The study shows this selective participation approach can improve group accuracy and potentially reduce hallucinations in large language model systems.
AINeutralarXiv – CS AI · Jun 106/10
🧠Researchers analyze multi-agent debate systems in AI by examining whether internal confidence signals (log-probabilities) correlate with external reasoning quality assessments and task accuracy. The study reveals significant role asymmetry between debating agents, with confidence metrics predicting reasoning quality twice as strongly for constructive agents compared to auditing agents, suggesting debate systems may have inherent structural biases.
AIBullisharXiv – CS AI · Jun 26/10
🧠Researchers demonstrate that multi-agent debate (MAD) for large language models significantly improves when agents have diverse initial viewpoints and explicitly communicate calibrated confidence levels. The study shows that vanilla MAD often underperforms simple majority voting despite higher computational costs, but two lightweight interventions—diversity-aware initialization and confidence-modulated debate protocols—consistently outperform both baseline approaches across multiple reasoning benchmarks.
AINeutralarXiv – CS AI · May 116/10
🧠Researchers evaluated metacognitive monitoring across 33 frontier LLMs using 47,151 MMLU benchmark items, finding significant domain-level variation masked by aggregate performance scores. Applied/Professional knowledge domains showed consistently strong self-monitoring (AUROC .742), while Formal Reasoning and Natural Science proved most challenging, with implications for targeted model deployment.
🏢 OpenAI🏢 Anthropic🧠 Gemini
AINeutralarXiv – CS AI · Apr 206/10
🧠Researchers present Deliberative Searcher, a framework that enhances large language model reliability by combining certainty calibration with retrieval-based search for question answering. The system uses reinforcement learning with soft reliability constraints to improve alignment between model confidence and actual correctness, producing more trustworthy outputs.
AIBullisharXiv – CS AI · Apr 76/10
🧠Researchers developed I-CALM, a prompt-based framework that reduces AI hallucinations by encouraging language models to abstain from answering when uncertain, rather than providing confident but incorrect responses. The method uses verbal confidence assessment and reward schemes to improve reliability without model retraining.
🧠 GPT-5