AIBearisharXiv – CS AI · Jun 117/10
🧠Researchers discover that Chain-of-Thought reasoning in large language models can paradoxically increase overconfidence when reasoning budgets exceed task-specific thresholds, a phenomenon called Calibration Drift Under Reasoning (CDUR). The study shows that while extended reasoning initially improves accuracy, it eventually produces internally consistent but incorrect explanations that mislead models into false confidence, with implications for safe LLM deployment.
🧠 Llama
AIBullisharXiv – CS AI · Apr 157/10
🧠Researchers introduce IDEA, a framework that converts Large Language Model decision-making into interpretable, editable parametric models with calibrated probabilities. The approach outperforms major LLMs like GPT-5.2 and DeepSeek R1 on benchmarks while enabling direct expert knowledge integration and precise human-AI collaboration.
🧠 GPT-5
AINeutralarXiv – CS AI · Mar 177/10
🧠Researchers applied Signal Detection Theory to analyze three large language models across 168,000 trials, finding that temperature parameter changes both sensitivity and response bias simultaneously. The study reveals that traditional calibration metrics miss important diagnostic information that SDT's full parametric framework can provide.
AIBullisharXiv – CS AI · Mar 117/10
🧠Researchers have developed Variational Mixture-of-Experts Routing (VMoER), a Bayesian framework that enables uncertainty quantification in large-scale AI models while adding less than 1% computational overhead. The method improves routing stability by 38%, reduces calibration error by 94%, and increases out-of-distribution detection by 12%.
AIBullisharXiv – CS AI · Mar 97/10
🧠Researchers propose a three-stage pipeline to train Large Language Models to efficiently provide calibrated uncertainty estimates for their responses. The method uses entropy-based scoring, Platt scaling calibration, and reinforcement learning to enable models to reason about uncertainty without computationally expensive post-hoc methods.
AIBullisharXiv – CS AI · Mar 57/10
🧠Researchers propose Supervised Calibration (SC), a new framework to improve In-Context Learning performance in Large Language Models by addressing systematic biases through optimal affine transformations in logit space. The method achieves state-of-the-art results across multiple LLMs including Mistral-7B, Llama-2-7B, and Qwen2-7B in few-shot learning scenarios.
🧠 Llama
AIBullisharXiv – CS AI · Mar 47/103
🧠Researchers developed GLEAN, a new AI verification framework that improves reliability of LLM-powered agents in high-stakes decisions like clinical diagnosis. The system uses expert guidelines and Bayesian logistic regression to better verify AI agent decisions, showing 12% improvement in accuracy and 50% better calibration in medical diagnosis tests.
AINeutralarXiv – CS AI · Feb 277/105
🧠Researchers have identified flaws in existing test-time guidance methods for diffusion models that prevent proper Bayesian posterior sampling. They propose new estimators that enable calibrated inference, significantly outperforming previous methods on Bayesian tasks and matching state-of-the-art results in black hole image reconstruction.
AINeutralarXiv – CS AI · 5d ago6/10
🧠Researchers demonstrate that calibration—aligning model confidence with actual accuracy—behaves differently in mixture-of-experts (MoE) models depending on routing mechanisms. While expert-level calibration suffices for hard-routed models under distribution shift, soft-routed models require additional adversarial reweighting techniques to maintain both accuracy and calibration reliability.
AINeutralarXiv – CS AI · Jun 96/10
🧠Researchers introduce ACUTE, a protocol that uses language model activations to improve confidence calibration and trustworthiness across multiple LLM tasks. The approach balances calibration accuracy with informativeness through a new EURO metric, addressing the persistent problem of overconfident AI systems.
AINeutralarXiv – CS AI · Jun 96/10
🧠SafeECGMatch introduces a calibration-aware semi-supervised learning framework for ECG classification that addresses the critical challenge of handling out-of-distribution anomalies in unlabeled medical data. Using dual-branch time-frequency architecture with adaptive confidence calibration, the method achieves state-of-the-art accuracy while maintaining reliable OOD rejection, advancing trustworthy AI deployment in clinical diagnostics.
AINeutralarXiv – CS AI · Jun 55/10
🧠Researchers propose FRAP (Fused Reference Alignment Prediction), a method that combines a foundation model with a domain-specific base model to improve performance estimation when AI models encounter distribution shifts. By aligning and fusing predictions from both models through calibration, FRAP provides more reliable performance indicators without ground-truth labels.
AINeutralarXiv – CS AI · Jun 46/10
🧠Researchers introduce Adaptive Calibration (AC), a novel technique that improves facial recognition systems by mapping cosine similarity to well-calibrated probabilities while accounting for regional variations in embedding space. The method achieves better accuracy and fairness metrics without requiring demographic metadata, addressing a fundamental limitation where identical distances can represent different match probabilities across different regions.
🏢 Meta
AINeutralarXiv – CS AI · Jun 26/10
🧠Researchers propose a framework for incorporating Large Language Model (LLM) priors into multi-objective Bayesian optimization while maintaining robustness against miscalibrated advice. Using an objective-wise reputation mechanism and counterfactual gating, the approach dynamically adjusts trust in LLM suggestions based on observed performance rather than accepting them blindly, with empirical validation across molecular optimization tasks.
AINeutralarXiv – CS AI · Jun 26/10
🧠Researchers discovered that large language models fail to refuse harmful requests in low-resource languages not because they lack the underlying safety representations, but because they cannot properly calibrate their safety decisions across languages. A recalibration approach using minimal target-language examples substantially improves refusal rates, suggesting safety alignment failures stem from decision calibration rather than representation gaps.
🧠 Llama
AINeutralarXiv – CS AI · Jun 16/10
🧠Researchers formalize calibration concepts for probabilistic label ranking, revealing that popular models often fail to align predicted probabilities with actual outcome frequencies. The framework uncovers a gap between sub-ranking and top-k calibration metrics, with implications for RLHF reward models used in AI systems.
AINeutralarXiv – CS AI · Jun 16/10
🧠Researchers propose Frequency-aware Gradient Rectification (FGR), a training framework that improves neural network calibration under distribution shifts without requiring access to target domains. The method uses low-pass filtering to reduce spurious patterns while maintaining in-distribution performance through geometric constraint projection.
AINeutralarXiv – CS AI · Jun 16/10
🧠Researchers introduce SCOPE, a framework that improves LLM-based pairwise evaluation by calibrating confidence thresholds to control error rates. Combined with a new uncertainty metric called Bidirectional Preference Entropy (BPE), the approach achieves reliable judgment quality while accepting significantly more evaluations than existing methods.
AINeutralarXiv – CS AI · May 296/10
🧠Researchers introduce CalArena, a large-scale benchmark for evaluating post-hoc calibration methods in machine learning, covering nearly 2000 experiments across diverse tasks and model types. The study reveals that smooth calibration functions significantly outperform binning-based approaches, and provides open-source implementations to standardize calibration research.
AIBullisharXiv – CS AI · May 296/10
🧠Researchers evaluated the calibration properties of five recent time series foundation models and found they maintain better confidence alignment than traditional deep learning approaches. Unlike typical neural networks that exhibit overconfidence, these foundation models demonstrate reliable uncertainty quantification across various forecasting scenarios, which is critical for real-world deployment in financial and operational decision-making.
AINeutralarXiv – CS AI · May 296/10
🧠Researchers introduce Rulers, a three-stage framework that improves how large language models evaluate text against human rubrics by converting qualitative criteria into locked specifications, structured checklists with evidence grounding, and calibrated score interpretation. The approach addresses three key failure modes in LLM-based scoring and demonstrates stronger alignment with human scoring across multiple benchmarks in essay evaluation, summarization, and writing assessment.
AINeutralarXiv – CS AI · May 296/10
🧠Researchers propose BT-sigma, a novel method for aggregating Large Language Model judgments in comparative evaluations that accounts for varying judge reliability without requiring human supervision. The approach significantly improves ranking accuracy compared to traditional averaging methods by modeling each LLM's discriminative capability as an unsupervised calibration mechanism.
AINeutralarXiv – CS AI · May 286/10
🧠Researchers provide the first rigorous theoretical analysis of temperature scaling, a widely-used technique for controlling uncertainty in machine learning models. The study reveals that while temperature scaling reliably increases entropy in classifiers, it does not necessarily increase diversity in large language models as commonly claimed, and establishes temperature scaling as the unique linear calibration method that preserves hard predictions.
AINeutralarXiv – CS AI · May 276/10
🧠Researchers have introduced the concept of 'innovation' as a fundamental property that characterizes hallucination in large language models, showing it serves as an almost-complete mathematical characterization of when LLMs produce false information. The work extends prior research by Kalai and Vempala, establishing that innovation—the tendency to generate outputs outside training data—inevitably leads to hallucination with high probability, providing new theoretical bounds on hallucination rates.
AINeutralarXiv – CS AI · May 276/10
🧠Researchers introduce MiRD, a two-stage framework that improves reliable prediction for open-ended question answering by separately addressing sampling failures and selection errors. The approach maintains calibration-set integrity while controlling hallucinations in AI models, outperforming existing conformal prediction methods across multiple datasets and models.