#confidence-estimation News & Analysis

14 articles tagged with #confidence-estimation. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

14 articles

AIBullisharXiv – CS AI · May 287/10

🧠

Confidence-Orchestrated Self-Evolution against Uncertain LLM Feedback

Researchers propose COSE, a self-evolution framework for large language models that uses confidence signals to filter noisy self-generated training feedback without external verifiers. The method demonstrates consistent improvements across 19 benchmarks and multiple model sizes (0.6B–4B parameters), achieving state-of-the-art performance in reasoning and mathematics tasks.

🧠 Llama

AINeutralarXiv – CS AI · Jun 236/10

🧠

The Origins of Stochasticity: Comprehensive Investigations on Uncertainty Quantification for Large Language Models

Researchers propose a comprehensive uncertainty quantification (UQ) framework for large language models, breaking down sources of error into input-level, parameter-level, token-level, and decoding-process components. Testing 21 UQ methods across Qwen3, Llama 3.2, and DeepSeek-V3 reveals that consensus-based approaches consistently outperform alternatives, while larger models exhibit lower uncertainty estimates according to an empirical scaling law.

🧠 Llama

AINeutralarXiv – CS AI · Jun 236/10

🧠

Latent Confidence Alignment for LLM Self-Assessment

Researchers propose Latent Confidence Alignment Error (LCAE), a new framework for evaluating how well large language models assess their own reliability by accounting for item difficulty and model ability. Testing on 20 medical-domain models shows the approach improves self-assessment quality without degrading performance, revealing a correlation between model reliability and computational inference costs.

AINeutralarXiv – CS AI · Jun 195/10

🧠

Confidence-Aware Automated Assessment of Student-Drawn Scientific Models

Researchers developed an automated Vision Transformer-based system to score student-drawn scientific models, addressing the costly manual assessment burden in science education. The confidence-aware framework selectively automates scoring of high-confidence submissions while deferring uncertain cases to human reviewers, demonstrating improved reliability across NGSS-aligned assessments.

AIBullisharXiv – CS AI · Jun 96/10

🧠

Margin-Adaptive Confidence Ranking for Reliable LLM Judgement

Researchers address a critical flaw in LLM confidence estimation for achieving human-AI agreement by developing a learned confidence estimator with theoretical generalization guarantees. This approach improves upon prior methods that assume confidence monotonically correlates with disagreement risk, offering practical benefits for aligning AI systems with human judgment.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

Researchers introduce Stepwise Confidence Attribution (SCA), a framework for diagnosing where large language models fail in multi-step reasoning tasks without requiring access to the model's internal parameters. The method identifies problematic reasoning steps by measuring confidence alignment with consensus patterns across correct solutions, improving self-correction accuracy by up to 13.5%.

AINeutralarXiv – CS AI · Jun 46/10

🧠

Uncertainty Estimation using Variance-Gated Distributions

Researchers propose a variance-gated framework for uncertainty quantification in neural networks that decomposes predictive uncertainty using signal-to-noise ratios rather than traditional additive methods. The approach scales predictions by confidence factors derived from ensembles and reveals potential diversity collapse in committee machines, advancing how machine learning models evaluate per-sample uncertainty for high-risk applications.

AINeutralarXiv – CS AI · Jun 16/10

🧠

Shared Doubt: Zero-shot Cross-Lingual Confidence Estimation for Language Models

Researchers demonstrate that multilingual large language models encode shared confidence features that transfer across languages without retraining. A lightweight linear probe trained on English can predict answer correctness in unseen languages with zero-shot generalization, suggesting confidence estimation mechanisms are language-universal in LLMs.

AINeutralarXiv – CS AI · May 286/10

🧠

Prefix-Safe Bayesian Belief Tracking for LLM Reasoning Reliability:Separating Calibration from Ranking

Researchers propose Sequential Bayesian Belief Tracking (SBBT), a framework for estimating the reliability of long reasoning chains in large language models before final answers are known. The study finds that probability calibration and ranking performance respond differently to various evidence types: scalar scores improve calibration metrics, while structural observations are needed for ranking tasks.

AINeutralarXiv – CS AI · May 126/10

🧠

The Metacognitive Probe: Five Behavioural Calibration Diagnostics for LLMs

Researchers introduce the Metacognitive Probe, a diagnostic tool measuring five dimensions of LLM confidence behavior including calibration, epistemic vigilance, and reasoning validation. Testing on eight frontier models and 69 humans reveals significant within-model disparities—exemplified by Gemini 2.5 Flash scoring 88 on confidence calibration but only 41 on difficulty prediction—suggesting composite benchmarks mask pockets of overconfidence.

🧠 Gemini

AINeutralarXiv – CS AI · May 126/10

🧠

A Semantic-Sampling Framework for Evaluating Calibration in Open-Ended Question Answering

Researchers introduce Sem-ECE, a new framework for evaluating how well large language models calibrate their confidence in open-ended question answering tasks. The method samples multiple answers from LLMs, groups them semantically, and uses answer frequency distributions as confidence measures, outperforming existing evaluation approaches across major commercial models.

AINeutralarXiv – CS AI · May 96/10

🧠

Measuring Black-Box Confidence via Reasoning Trajectories: Geometry, Coverage, and Verbalization

Researchers propose a novel black-box confidence estimation method for chain-of-thought reasoning that measures trajectory convergence rather than relying on expensive sampling. Testing across multiple benchmarks and AI models shows significant improvements over self-consistency baselines while requiring only 4 samples instead of 8, with potential applications for safer API-based AI deployment.

🧠 GPT-5🧠 Claude🧠 Sonnet

AIBullisharXiv – CS AI · May 76/10

🧠

CAR: Query-Guided Confidence-Aware Reranking for Retrieval-Augmented Generation

Researchers introduce CAR (Confidence-Aware Reranking), a training-free framework that improves document ranking in Retrieval-Augmented Generation systems by measuring how much each document increases the language model's confidence rather than just relevance. Testing across multiple datasets shows consistent improvements in ranking quality and downstream generation performance.

AIBullisharXiv – CS AI · Mar 36/103

🧠

Calibrating Verbalized Confidence with Self-Generated Distractors

Researchers introduce DINCO (Distractor-Normalized Coherence), a method to improve confidence calibration in large language models by using self-generated alternative claims to reduce overconfidence bias. The approach addresses LLM suggestibility issues that cause models to express high confidence on low-accuracy outputs, potentially improving AI safety and trustworthiness.