y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#confidence-estimation News & Analysis

9 articles tagged with #confidence-estimation. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

9 articles
AIBullisharXiv – CS AI · May 287/10
🧠

Confidence-Orchestrated Self-Evolution against Uncertain LLM Feedback

Researchers propose COSE, a self-evolution framework for large language models that uses confidence signals to filter noisy self-generated training feedback without external verifiers. The method demonstrates consistent improvements across 19 benchmarks and multiple model sizes (0.6B–4B parameters), achieving state-of-the-art performance in reasoning and mathematics tasks.

🧠 Llama
AINeutralarXiv – CS AI · 4d ago6/10
🧠

Uncertainty Estimation using Variance-Gated Distributions

Researchers propose a variance-gated framework for uncertainty quantification in neural networks that decomposes predictive uncertainty using signal-to-noise ratios rather than traditional additive methods. The approach scales predictions by confidence factors derived from ensembles and reveals potential diversity collapse in committee machines, advancing how machine learning models evaluate per-sample uncertainty for high-risk applications.

AINeutralarXiv – CS AI · Jun 16/10
🧠

Shared Doubt: Zero-shot Cross-Lingual Confidence Estimation for Language Models

Researchers demonstrate that multilingual large language models encode shared confidence features that transfer across languages without retraining. A lightweight linear probe trained on English can predict answer correctness in unseen languages with zero-shot generalization, suggesting confidence estimation mechanisms are language-universal in LLMs.

AINeutralarXiv – CS AI · May 286/10
🧠

Prefix-Safe Bayesian Belief Tracking for LLM Reasoning Reliability:Separating Calibration from Ranking

Researchers propose Sequential Bayesian Belief Tracking (SBBT), a framework for estimating the reliability of long reasoning chains in large language models before final answers are known. The study finds that probability calibration and ranking performance respond differently to various evidence types: scalar scores improve calibration metrics, while structural observations are needed for ranking tasks.

AINeutralarXiv – CS AI · May 126/10
🧠

The Metacognitive Probe: Five Behavioural Calibration Diagnostics for LLMs

Researchers introduce the Metacognitive Probe, a diagnostic tool measuring five dimensions of LLM confidence behavior including calibration, epistemic vigilance, and reasoning validation. Testing on eight frontier models and 69 humans reveals significant within-model disparities—exemplified by Gemini 2.5 Flash scoring 88 on confidence calibration but only 41 on difficulty prediction—suggesting composite benchmarks mask pockets of overconfidence.

🧠 Gemini
AINeutralarXiv – CS AI · May 126/10
🧠

A Semantic-Sampling Framework for Evaluating Calibration in Open-Ended Question Answering

Researchers introduce Sem-ECE, a new framework for evaluating how well large language models calibrate their confidence in open-ended question answering tasks. The method samples multiple answers from LLMs, groups them semantically, and uses answer frequency distributions as confidence measures, outperforming existing evaluation approaches across major commercial models.

AINeutralarXiv – CS AI · May 96/10
🧠

Measuring Black-Box Confidence via Reasoning Trajectories: Geometry, Coverage, and Verbalization

Researchers propose a novel black-box confidence estimation method for chain-of-thought reasoning that measures trajectory convergence rather than relying on expensive sampling. Testing across multiple benchmarks and AI models shows significant improvements over self-consistency baselines while requiring only 4 samples instead of 8, with potential applications for safer API-based AI deployment.

🧠 GPT-5🧠 Claude🧠 Sonnet
AIBullisharXiv – CS AI · May 76/10
🧠

CAR: Query-Guided Confidence-Aware Reranking for Retrieval-Augmented Generation

Researchers introduce CAR (Confidence-Aware Reranking), a training-free framework that improves document ranking in Retrieval-Augmented Generation systems by measuring how much each document increases the language model's confidence rather than just relevance. Testing across multiple datasets shows consistent improvements in ranking quality and downstream generation performance.

AIBullisharXiv – CS AI · Mar 36/103
🧠

Calibrating Verbalized Confidence with Self-Generated Distractors

Researchers introduce DINCO (Distractor-Normalized Coherence), a method to improve confidence calibration in large language models by using self-generated alternative claims to reduce overconfidence bias. The approach addresses LLM suggestibility issues that cause models to express high confidence on low-accuracy outputs, potentially improving AI safety and trustworthiness.