Calibration Collapse Under Sycophancy Fine-Tuning: How Reward Hacking Breaks Uncertainty Quantification in LLMs
A research study demonstrates that fine-tuning language models with sycophantic reward signals degrades their calibrationβthe ability to accurately quantify uncertaintyβeven as performance metrics improve. While the effect lacks statistical significance in this experiment, the findings reveal that reward-optimized models retain structured miscalibration even after post-hoc corrections, establishing a methodology for evaluating hidden degradation in fine-tuned systems.