A Semantic-Sampling Framework for Evaluating Calibration in Open-Ended Question Answering
Researchers introduce Sem-ECE, a new framework for evaluating how well large language models calibrate their confidence in open-ended question answering tasks. The method samples multiple answers from LLMs, groups them semantically, and uses answer frequency distributions as confidence measures, outperforming existing evaluation approaches across major commercial models.
The reliability of large language models in high-stakes applications depends critically on calibration—whether a model's stated confidence actually reflects its accuracy. Current evaluation methods suffer from significant limitations: logit-based approaches require restricted output formats and internal model access; verbalized confidence is often unreliably self-reported; and existing sampling methods lack theoretical grounding. This research gap matters because open-ended QA represents the most common real-world deployment scenario for LLMs, yet existing calibration metrics fail to handle it adequately.
Sem-ECE addresses this by treating semantic grouping of sampled outputs as a principled confidence measure. The framework introduces two estimators with different trade-offs: Sem₁-ECE uses the same sample for answer selection and confidence evaluation, while Sem₂-ECE separates these steps to reduce bias. The researchers prove both are asymptotically unbiased and demonstrate that their divergence on harder questions provides a diagnostic signal for question difficulty.
For LLM deployment in medicine, law, and other regulated domains, this framework enables better risk assessment without requiring model internals. The work validates across three benchmarks and five leading commercial models, establishing a practical standard for calibration evaluation. The theoretical contributions—proving unbiasedness and connecting estimator divergence to difficulty—provide foundation for future refinements.
Looking ahead, adoption of Sem-ECE could influence how organizations evaluate LLMs before deployment, potentially becoming embedded in regulatory compliance processes. The framework's model-agnostic design makes it applicable as new models emerge.
- →Sem-ECE provides the first principled calibration evaluation method specifically designed for open-ended QA without requiring internal model access
- →Two estimators within the framework offer different bias-variance trade-offs, with Sem₂-ECE achieving strictly lower calibration error on difficult questions
- →The gap between estimators serves as a diagnostic tool to identify question difficulty, enabling better risk assessment in deployment
- →Framework validation across five commercial LLMs and three benchmarks demonstrates practical effectiveness over existing verbalized confidence and sampling methods
- →Model-agnostic design enables application across different LLM architectures without modification or internal probability access