MiRD: Reliable Set-Valued Prediction for Open-Ended Question Answering via Miscoverage Risk Decomposition
Researchers introduce MiRD, a two-stage framework that improves reliable prediction for open-ended question answering by separately addressing sampling failures and selection errors. The approach maintains calibration-set integrity while controlling hallucinations in AI models, outperforming existing conformal prediction methods across multiple datasets and models.
MiRD addresses a fundamental limitation in current approaches to reducing hallucinations in large language models tasked with open-ended question answering. Traditional conformal prediction methods discard calibration examples when finite sampling fails to produce valid answers, losing valuable data that could improve model reliability. This practice undermines the statistical guarantees these methods promise, creating a fragile foundation for production systems.
The two-stage decomposition strategy represents a meaningful advance in uncertainty quantification for AI systems. Stage I establishes theoretical bounds on sampling failure probability under fixed computational budgets, acknowledging that some requests may not yield admissible answers regardless of model capability. Stage II then applies conformal calibration to the remaining cases, using admission-correlated nonconformity scores that leverage the full calibration dataset. This decomposition preserves statistical rigor while maintaining practical applicability.
For organizations deploying large language models in critical applications—customer support, medical information retrieval, legal document analysis—this framework offers quantifiable risk control. The experimental validation across eight models and three datasets demonstrates generalizability. Tighter bounds than PAC-style alternatives mean practitioners can maintain acceptable coverage thresholds while reducing unnecessary abstention rates, improving system utility.
The significance lies in bridging the gap between theoretical guarantees and practical model deployment. As enterprises increasingly integrate LLMs into production systems, confidence in failure detection and controlled miscoverage becomes commercially critical. MiRD's preservation of calibration-set integrity also enables more efficient use of expensive human annotations required for validation datasets, reducing operational costs while improving reliability.
- →MiRD decomposes miscoverage into sampling failure and selection failure, enabling more nuanced risk control than existing conformal methods.
- →The framework preserves full calibration-set integrity by not discarding examples where sampling fails, improving statistical efficiency.
- →Stage I provides expectation-level marginal bounds on sampling failure probability, while Stage II applies adaptive conformal calibration conditioned on sampling success.
- →Experimental validation across eight models and three datasets shows MiRD achieves tighter bounds and more adaptive prediction sets than baseline approaches.
- →The approach directly addresses hallucination mitigation in open-ended QA, a critical reliability challenge for production language model deployments.