y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#calibration News & Analysis

9 articles tagged with #calibration. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

9 articles
AIBullisharXiv โ€“ CS AI ยท 3d ago7/10
๐Ÿง 

IDEA: An Interpretable and Editable Decision-Making Framework for LLMs via Verbal-to-Numeric Calibration

Researchers introduce IDEA, a framework that converts Large Language Model decision-making into interpretable, editable parametric models with calibrated probabilities. The approach outperforms major LLMs like GPT-5.2 and DeepSeek R1 on benchmarks while enabling direct expert knowledge integration and precise human-AI collaboration.

๐Ÿง  GPT-5
AINeutralarXiv โ€“ CS AI ยท Mar 177/10
๐Ÿง 

LLMs as Signal Detectors: Sensitivity, Bias, and the Temperature-Criterion Analogy

Researchers applied Signal Detection Theory to analyze three large language models across 168,000 trials, finding that temperature parameter changes both sensitivity and response bias simultaneously. The study reveals that traditional calibration metrics miss important diagnostic information that SDT's full parametric framework can provide.

AIBullisharXiv โ€“ CS AI ยท Mar 117/10
๐Ÿง 

Variational Routing: A Scalable Bayesian Framework for Calibrated Mixture-of-Experts Transformers

Researchers have developed Variational Mixture-of-Experts Routing (VMoER), a Bayesian framework that enables uncertainty quantification in large-scale AI models while adding less than 1% computational overhead. The method improves routing stability by 38%, reduces calibration error by 94%, and increases out-of-distribution detection by 12%.

AIBullisharXiv โ€“ CS AI ยท Mar 97/10
๐Ÿง 

From Entropy to Calibrated Uncertainty: Training Language Models to Reason About Uncertainty

Researchers propose a three-stage pipeline to train Large Language Models to efficiently provide calibrated uncertainty estimates for their responses. The method uses entropy-based scoring, Platt scaling calibration, and reinforcement learning to enable models to reason about uncertainty without computationally expensive post-hoc methods.

AIBullisharXiv โ€“ CS AI ยท Mar 57/10
๐Ÿง 

Boosting In-Context Learning in LLMs Through the Lens of Classical Supervised Learning

Researchers propose Supervised Calibration (SC), a new framework to improve In-Context Learning performance in Large Language Models by addressing systematic biases through optimal affine transformations in logit space. The method achieves state-of-the-art results across multiple LLMs including Mistral-7B, Llama-2-7B, and Qwen2-7B in few-shot learning scenarios.

๐Ÿง  Llama
AIBullisharXiv โ€“ CS AI ยท Mar 47/103
๐Ÿง 

Guideline-Grounded Evidence Accumulation for High-Stakes Agent Verification

Researchers developed GLEAN, a new AI verification framework that improves reliability of LLM-powered agents in high-stakes decisions like clinical diagnosis. The system uses expert guidelines and Bayesian logistic regression to better verify AI agent decisions, showing 12% improvement in accuracy and 50% better calibration in medical diagnosis tests.

AINeutralarXiv โ€“ CS AI ยท Feb 277/105
๐Ÿง 

Calibrated Test-Time Guidance for Bayesian Inference

Researchers have identified flaws in existing test-time guidance methods for diffusion models that prevent proper Bayesian posterior sampling. They propose new estimators that enable calibrated inference, significantly outperforming previous methods on Bayesian tasks and matching state-of-the-art results in black hole image reconstruction.

AINeutralarXiv โ€“ CS AI ยท 4d ago6/10
๐Ÿง 

SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?

Researchers introduce SciPredict, a benchmark testing whether large language models can predict scientific experiment outcomes across physics, biology, and chemistry. The study reveals that while some frontier models marginally exceed human experts (~20% accuracy), they fundamentally fail to assess prediction reliability, suggesting superhuman performance in experimental science requires not just better predictions but better calibration awareness.

AINeutralarXiv โ€“ CS AI ยท Mar 36/106
๐Ÿง 

Self-Anchoring Calibration Drift in Large Language Models: How Multi-Turn Conversations Reshape Model Confidence

Researchers identified Self-Anchoring Calibration Drift (SACD), where large language models show systematic confidence changes when building on their own outputs in multi-turn conversations. Testing Claude Sonnet 4.6, Gemini 3.1 Pro, and GPT-5.2 revealed model-specific patterns, with Claude showing decreasing confidence and significant calibration errors, while GPT-5.2 exhibited opposite behavior in open-ended domains.

$NEAR