AINeutralarXiv – CS AI · 10h ago6/10
🧠
A Semantic-Sampling Framework for Evaluating Calibration in Open-Ended Question Answering
Researchers introduce Sem-ECE, a new framework for evaluating how well large language models calibrate their confidence in open-ended question answering tasks. The method samples multiple answers from LLMs, groups them semantically, and uses answer frequency distributions as confidence measures, outperforming existing evaluation approaches across major commercial models.