9 articles tagged with #calibration. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AIBullisharXiv โ CS AI ยท 3d ago7/10
๐ง Researchers introduce IDEA, a framework that converts Large Language Model decision-making into interpretable, editable parametric models with calibrated probabilities. The approach outperforms major LLMs like GPT-5.2 and DeepSeek R1 on benchmarks while enabling direct expert knowledge integration and precise human-AI collaboration.
๐ง GPT-5
AINeutralarXiv โ CS AI ยท Mar 177/10
๐ง Researchers applied Signal Detection Theory to analyze three large language models across 168,000 trials, finding that temperature parameter changes both sensitivity and response bias simultaneously. The study reveals that traditional calibration metrics miss important diagnostic information that SDT's full parametric framework can provide.
AIBullisharXiv โ CS AI ยท Mar 117/10
๐ง Researchers have developed Variational Mixture-of-Experts Routing (VMoER), a Bayesian framework that enables uncertainty quantification in large-scale AI models while adding less than 1% computational overhead. The method improves routing stability by 38%, reduces calibration error by 94%, and increases out-of-distribution detection by 12%.
AIBullisharXiv โ CS AI ยท Mar 97/10
๐ง Researchers propose a three-stage pipeline to train Large Language Models to efficiently provide calibrated uncertainty estimates for their responses. The method uses entropy-based scoring, Platt scaling calibration, and reinforcement learning to enable models to reason about uncertainty without computationally expensive post-hoc methods.
AIBullisharXiv โ CS AI ยท Mar 57/10
๐ง Researchers propose Supervised Calibration (SC), a new framework to improve In-Context Learning performance in Large Language Models by addressing systematic biases through optimal affine transformations in logit space. The method achieves state-of-the-art results across multiple LLMs including Mistral-7B, Llama-2-7B, and Qwen2-7B in few-shot learning scenarios.
๐ง Llama
AIBullisharXiv โ CS AI ยท Mar 47/103
๐ง Researchers developed GLEAN, a new AI verification framework that improves reliability of LLM-powered agents in high-stakes decisions like clinical diagnosis. The system uses expert guidelines and Bayesian logistic regression to better verify AI agent decisions, showing 12% improvement in accuracy and 50% better calibration in medical diagnosis tests.
AINeutralarXiv โ CS AI ยท Feb 277/105
๐ง Researchers have identified flaws in existing test-time guidance methods for diffusion models that prevent proper Bayesian posterior sampling. They propose new estimators that enable calibrated inference, significantly outperforming previous methods on Bayesian tasks and matching state-of-the-art results in black hole image reconstruction.
AINeutralarXiv โ CS AI ยท 4d ago6/10
๐ง Researchers introduce SciPredict, a benchmark testing whether large language models can predict scientific experiment outcomes across physics, biology, and chemistry. The study reveals that while some frontier models marginally exceed human experts (~20% accuracy), they fundamentally fail to assess prediction reliability, suggesting superhuman performance in experimental science requires not just better predictions but better calibration awareness.
AINeutralarXiv โ CS AI ยท Mar 36/106
๐ง Researchers identified Self-Anchoring Calibration Drift (SACD), where large language models show systematic confidence changes when building on their own outputs in multi-turn conversations. Testing Claude Sonnet 4.6, Gemini 3.1 Pro, and GPT-5.2 revealed model-specific patterns, with Claude showing decreasing confidence and significant calibration errors, while GPT-5.2 exhibited opposite behavior in open-ended domains.
$NEAR