#calibration News & Analysis

28 articles tagged with #calibration. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

28 articles

AINeutralarXiv – CS AI · May 96/10

🧠

Measuring Black-Box Confidence via Reasoning Trajectories: Geometry, Coverage, and Verbalization

Researchers propose a novel black-box confidence estimation method for chain-of-thought reasoning that measures trajectory convergence rather than relying on expensive sampling. Testing across multiple benchmarks and AI models shows significant improvements over self-consistency baselines while requiring only 4 samples instead of 8, with potential applications for safer API-based AI deployment.

🧠 GPT-5🧠 Claude🧠 Sonnet

AINeutralarXiv – CS AI · Apr 146/10

🧠

SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?

Researchers introduce SciPredict, a benchmark testing whether large language models can predict scientific experiment outcomes across physics, biology, and chemistry. The study reveals that while some frontier models marginally exceed human experts (~20% accuracy), they fundamentally fail to assess prediction reliability, suggesting superhuman performance in experimental science requires not just better predictions but better calibration awareness.

AINeutralarXiv – CS AI · Mar 36/106

🧠

Self-Anchoring Calibration Drift in Large Language Models: How Multi-Turn Conversations Reshape Model Confidence

Researchers identified Self-Anchoring Calibration Drift (SACD), where large language models show systematic confidence changes when building on their own outputs in multi-turn conversations. Testing Claude Sonnet 4.6, Gemini 3.1 Pro, and GPT-5.2 revealed model-specific patterns, with Claude showing decreasing confidence and significant calibration errors, while GPT-5.2 exhibited opposite behavior in open-ended domains.

$NEAR

← PrevPage 2 of 2