🧠 AI⚪ NeutralImportance 7/10

Closing the Confidence-Faithfulness Gap in Large Language Models

arXiv – CS AI|Miranda Muqing Miao, Lyle Ungar|March 27, 2026 at 04:00 AM

🤖AI Summary

Researchers have identified a fundamental issue in large language models where verbalized confidence scores don't align with actual accuracy due to orthogonal encoding of these signals. They discovered a 'Reasoning Contamination Effect' where simultaneous reasoning disrupts confidence calibration, and developed a two-stage adaptive steering pipeline to improve alignment.

Key Takeaways

→Large language models encode calibration and verbalized confidence signals linearly but orthogonally to each other.
→The 'Reasoning Contamination Effect' occurs when models reason through problems while verbalizing confidence, worsening miscalibration.
→Researchers tested their findings across three open-weight models and four datasets with consistent results.
→A new two-stage adaptive steering pipeline can substantially improve calibration alignment by reading internal accuracy estimates.
→The geometric relationship between confidence and accuracy in LLMs was previously poorly understood but now shows clear mechanistic patterns.