y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#calibration News & Analysis

27 articles tagged with #calibration. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

27 articles
AINeutralarXiv – CS AI · Mar 177/10
🧠

LLMs as Signal Detectors: Sensitivity, Bias, and the Temperature-Criterion Analogy

Researchers applied Signal Detection Theory to analyze three large language models across 168,000 trials, finding that temperature parameter changes both sensitivity and response bias simultaneously. The study reveals that traditional calibration metrics miss important diagnostic information that SDT's full parametric framework can provide.

AIBullisharXiv – CS AI · Mar 97/10
🧠

From Entropy to Calibrated Uncertainty: Training Language Models to Reason About Uncertainty

Researchers propose a three-stage pipeline to train Large Language Models to efficiently provide calibrated uncertainty estimates for their responses. The method uses entropy-based scoring, Platt scaling calibration, and reinforcement learning to enable models to reason about uncertainty without computationally expensive post-hoc methods.

AIBullisharXiv – CS AI · Mar 57/10
🧠

Boosting In-Context Learning in LLMs Through the Lens of Classical Supervised Learning

Researchers propose Supervised Calibration (SC), a new framework to improve In-Context Learning performance in Large Language Models by addressing systematic biases through optimal affine transformations in logit space. The method achieves state-of-the-art results across multiple LLMs including Mistral-7B, Llama-2-7B, and Qwen2-7B in few-shot learning scenarios.

🧠 Llama
AIBullisharXiv – CS AI · Mar 47/103
🧠

Guideline-Grounded Evidence Accumulation for High-Stakes Agent Verification

Researchers developed GLEAN, a new AI verification framework that improves reliability of LLM-powered agents in high-stakes decisions like clinical diagnosis. The system uses expert guidelines and Bayesian logistic regression to better verify AI agent decisions, showing 12% improvement in accuracy and 50% better calibration in medical diagnosis tests.

AINeutralarXiv – CS AI · Feb 277/105
🧠

Calibrated Test-Time Guidance for Bayesian Inference

Researchers have identified flaws in existing test-time guidance methods for diffusion models that prevent proper Bayesian posterior sampling. They propose new estimators that enable calibrated inference, significantly outperforming previous methods on Bayesian tasks and matching state-of-the-art results in black hole image reconstruction.

AINeutralarXiv – CS AI · 1d ago6/10
🧠

SafeECGMatch: Calibration-Aware Joint Frequency and Time Space Semi-Supervised Learning for Open-Set ECG Classification

SafeECGMatch introduces a calibration-aware semi-supervised learning framework for ECG classification that addresses the critical challenge of handling out-of-distribution anomalies in unlabeled medical data. Using dual-branch time-frequency architecture with adaptive confidence calibration, the method achieves state-of-the-art accuracy while maintaining reliable OOD rejection, advancing trustworthy AI deployment in clinical diagnostics.

AINeutralarXiv – CS AI · 5d ago5/10
🧠

Bridging Domain Expertise and Generalization for Performance Estimation

Researchers propose FRAP (Fused Reference Alignment Prediction), a method that combines a foundation model with a domain-specific base model to improve performance estimation when AI models encounter distribution shifts. By aligning and fusing predictions from both models through calibration, FRAP provides more reliable performance indicators without ground-truth labels.

AINeutralarXiv – CS AI · 6d ago6/10
🧠

Adaptive Calibration for Fair and Performant Facial Recognition

Researchers introduce Adaptive Calibration (AC), a novel technique that improves facial recognition systems by mapping cosine similarity to well-calibrated probabilities while accounting for regional variations in embedding space. The method achieves better accuracy and fairness metrics without requiring demographic metadata, addressing a fundamental limitation where identical distances can represent different match probabilities across different regions.

🏢 Meta
AINeutralarXiv – CS AI · Jun 26/10
🧠

Evidence-Gated LLM Priors for Multi-Objective Bayesian Optimization

Researchers propose a framework for incorporating Large Language Model (LLM) priors into multi-objective Bayesian optimization while maintaining robustness against miscalibrated advice. Using an objective-wise reputation mechanism and counterfactual gating, the approach dynamically adjusts trust in LLM suggestions based on observed performance rather than accepting them blindly, with empirical validation across molecular optimization tasks.

AINeutralarXiv – CS AI · Jun 26/10
🧠

Low-Resource Safety Failures Are Action Failures, Not Representation Failures

Researchers discovered that large language models fail to refuse harmful requests in low-resource languages not because they lack the underlying safety representations, but because they cannot properly calibrate their safety decisions across languages. A recalibration approach using minimal target-language examples substantially improves refusal rates, suggesting safety alignment failures stem from decision calibration rather than representation gaps.

🧠 Llama
AINeutralarXiv – CS AI · Jun 16/10
🧠

Calibrated Preference Learning: The Case of Label Ranking

Researchers formalize calibration concepts for probabilistic label ranking, revealing that popular models often fail to align predicted probabilities with actual outcome frequencies. The framework uncovers a gap between sub-ranking and top-k calibration metrics, with implications for RLHF reward models used in AI systems.

AINeutralarXiv – CS AI · Jun 16/10
🧠

Target-Agnostic Calibration under Distribution Shift with Frequency-Aware Gradient Rectification

Researchers propose Frequency-aware Gradient Rectification (FGR), a training framework that improves neural network calibration under distribution shifts without requiring access to target domains. The method uses low-pass filtering to reduce spurious patterns while maintaining in-distribution performance through geometric constraint projection.

AINeutralarXiv – CS AI · Jun 16/10
🧠

SCOPE: Selective Conformal Optimized Pairwise LLM Judging

Researchers introduce SCOPE, a framework that improves LLM-based pairwise evaluation by calibrating confidence thresholds to control error rates. Combined with a new uncertainty metric called Bidirectional Preference Entropy (BPE), the approach achieves reliable judgment quality while accepting significantly more evaluations than existing methods.

AINeutralarXiv – CS AI · May 296/10
🧠

CalArena: A Large-Scale Post-Hoc Calibration Benchmark

Researchers introduce CalArena, a large-scale benchmark for evaluating post-hoc calibration methods in machine learning, covering nearly 2000 experiments across diverse tasks and model types. The study reveals that smooth calibration functions significantly outperform binning-based approaches, and provides open-source implementations to standardize calibration research.

AIBullisharXiv – CS AI · May 296/10
🧠

Beyond Accuracy: Are Time Series Foundation Models Well-Calibrated?

Researchers evaluated the calibration properties of five recent time series foundation models and found they maintain better confidence alignment than traditional deep learning approaches. Unlike typical neural networks that exhibit overconfidence, these foundation models demonstrate reliable uncertainty quantification across various forecasting scenarios, which is critical for real-world deployment in financial and operational decision-making.

AINeutralarXiv – CS AI · May 296/10
🧠

From Rubrics to Reliable Scores: Evidence-Grounded Text Evaluation with LLM Judges

Researchers introduce Rulers, a three-stage framework that improves how large language models evaluate text against human rubrics by converting qualitative criteria into locked specifications, structured checklists with evidence grounding, and calibrated score interpretation. The approach addresses three key failure modes in LLM-based scoring and demonstrates stronger alignment with human scoring across multiple benchmarks in essay evaluation, summarization, and writing assessment.

AINeutralarXiv – CS AI · May 296/10
🧠

Who can we trust? LLM-as-a-jury for Comparative Assessment

Researchers propose BT-sigma, a novel method for aggregating Large Language Model judgments in comparative evaluations that accounts for varying judge reliability without requiring human supervision. The approach significantly improves ranking accuracy compared to traditional averaging methods by modeling each LLM's discriminative capability as an unsupervised calibration mechanism.

AINeutralarXiv – CS AI · May 286/10
🧠

The Well-Tempered Classifier: Some Elementary Properties of Temperature Scaling

Researchers provide the first rigorous theoretical analysis of temperature scaling, a widely-used technique for controlling uncertainty in machine learning models. The study reveals that while temperature scaling reliably increases entropy in classifiers, it does not necessarily increase diversity in large language models as commonly claimed, and establishes temperature scaling as the unique linear calibration method that preserves hard predictions.

AINeutralarXiv – CS AI · May 276/10
🧠

Innovation: An Almost Characterization of Hallucination

Researchers have introduced the concept of 'innovation' as a fundamental property that characterizes hallucination in large language models, showing it serves as an almost-complete mathematical characterization of when LLMs produce false information. The work extends prior research by Kalai and Vempala, establishing that innovation—the tendency to generate outputs outside training data—inevitably leads to hallucination with high probability, providing new theoretical bounds on hallucination rates.

AINeutralarXiv – CS AI · May 276/10
🧠

MiRD: Reliable Set-Valued Prediction for Open-Ended Question Answering via Miscoverage Risk Decomposition

Researchers introduce MiRD, a two-stage framework that improves reliable prediction for open-ended question answering by separately addressing sampling failures and selection errors. The approach maintains calibration-set integrity while controlling hallucinations in AI models, outperforming existing conformal prediction methods across multiple datasets and models.

AINeutralarXiv – CS AI · May 116/10
🧠

Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport

Researchers propose using conditional optimal transport to improve calibration of Process Reward Models (PRMs) used in AI inference-time scaling, addressing the problem of overestimated success probabilities. The method enables better confidence bounds for mathematical reasoning tasks and improves downstream performance in Best-of-N selection frameworks.

AINeutralarXiv – CS AI · May 96/10
🧠

Measuring Black-Box Confidence via Reasoning Trajectories: Geometry, Coverage, and Verbalization

Researchers propose a novel black-box confidence estimation method for chain-of-thought reasoning that measures trajectory convergence rather than relying on expensive sampling. Testing across multiple benchmarks and AI models shows significant improvements over self-consistency baselines while requiring only 4 samples instead of 8, with potential applications for safer API-based AI deployment.

🧠 GPT-5🧠 Claude🧠 Sonnet
Page 1 of 2Next →