#model-reliability News & Analysis

40 articles tagged with #model-reliability. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

40 articles

AIBullisharXiv – CS AI · 4d ago7/10

🧠

PaTAS: A Framework for Trust Propagation in Neural Networks Using Subjective Logic

Researchers introduce PaTAS (Parallel Trust Assessment System), a framework that uses Subjective Logic to measure and propagate trust through neural networks alongside standard computation. The system identifies reliability gaps and adversarial vulnerabilities that traditional metrics like accuracy fail to detect, offering a foundation for deploying AI safely in critical applications.

AIBearisharXiv – CS AI · May 127/10

🧠

Political Plasticity: An Analysis of Ideological Adaptability in Large Language Models

Researchers developed a testing framework to study "political plasticity"—how Large Language Models adapt their ideological responses based on user context. The study found that newer, larger LLMs reliably shift responses along economic and personal freedom axes when prompted with few-shot examples, while older models show limited adaptability, raising concerns about potential data leakage and model reliability.

AINeutralarXiv – CS AI · May 127/10

🧠

Where Reliability Lives in Vision-Language Models: A Mechanistic Study of Attention, Hidden States, and Causal Circuits

Researchers challenge the widespread assumption that sharp attention maps in vision-language models indicate reliable outputs. Through mechanistic analysis of three VLM families (LLaVA, PaliGemma, Qwen2-VL), they find attention structure is nearly uncorrelated with correctness, while hidden-state geometry and late-layer circuits prove far more predictive of model reliability.

AIBullisharXiv – CS AI · May 117/10

🧠

Confidence-Aware Alignment Makes Reasoning LLMs More Reliable

Researchers introduce CASPO, a framework that improves reasoning reliability in large language models by aligning token-level confidence with step-wise logical correctness through preference optimization. The method achieves better performance than tree-search approaches without requiring separate reward models, while introducing CaT inference that dynamically prunes uncertain reasoning branches with minimal computational overhead.

AIBullisharXiv – CS AI · May 77/10

🧠

Towards Robust LLM Post-Training: Automatic Failure Management for Reinforcement Fine-Tuning

Researchers introduce RFT-FaultBench, the first comprehensive benchmark for diagnosing failures in reinforcement fine-tuning of large language models, and propose RFT-FM, an automated framework for detecting, diagnosing, and remediating training failures. This addresses a critical gap in LLM post-training reliability where practitioners currently rely on manual inspection.

AIBearisharXiv – CS AI · Apr 207/10

🧠

The Reasoning Trap: How Enhancing LLM Reasoning Amplifies Tool Hallucination

Researchers demonstrate that enhancing LLM reasoning capabilities through reinforcement learning paradoxically increases tool hallucination—where models incorrectly invoke non-existent or inappropriate tools. The study reveals a fundamental trade-off where stronger reasoning correlates with higher hallucination rates, suggesting current AI agent development approaches may inherently compromise reliability for capability.

🏢 OpenAI

AINeutralarXiv – CS AI · Apr 157/10

🧠

Benchmarking Deflection and Hallucination in Large Vision-Language Models

Researchers introduce VLM-DeflectionBench, a new benchmark with 2,775 samples designed to evaluate how large vision-language models handle conflicting or insufficient evidence. The study reveals that most state-of-the-art LVLMs fail to appropriately deflect when faced with noisy or misleading information, highlighting critical gaps in model reliability for knowledge-intensive tasks.

AIBearisharXiv – CS AI · Apr 147/10

🧠

Sanity Checks for Agentic Data Science

Researchers propose lightweight sanity checks for agentic data science (ADS) systems to detect falsely optimistic conclusions that users struggle to identify. Using the Predictability-Computability-Stability framework, the checks expose whether AI agents like OpenAI Codex reliably distinguish signal from noise. Testing on 11 real datasets reveals that over half produced unsupported affirmative conclusions despite individual runs suggesting otherwise.

🏢 OpenAI

AIBullisharXiv – CS AI · Apr 77/10

🧠

Evolutionary Search for Automated Design of Uncertainty Quantification Methods

Researchers developed an LLM-powered evolutionary search method to automatically design uncertainty quantification systems for large language models, achieving up to 6.7% improvement in performance over manual designs. The study found that different AI models employ distinct evolutionary strategies, with some favoring complex linear estimators while others prefer simpler positional weighting approaches.

🧠 Claude🧠 Sonnet🧠 Opus

AINeutralarXiv – CS AI · Mar 97/10

🧠

Agentic retrieval-augmented reasoning reshapes collective reliability under model variability in radiology question answering

Researchers evaluated 34 large language models on radiology questions, finding that agentic retrieval-augmented reasoning systems improve consensus and reliability across different AI models. The study shows these systems reduce decision variability between models and increase robust correctness, though 72% of incorrect outputs still carried moderate to high clinical severity.

AINeutralarXiv – CS AI · Mar 57/10

🧠

Certainty robustness: Evaluating LLM stability under self-challenging prompts

Researchers introduce the Certainty Robustness Benchmark, a new evaluation framework that tests how large language models handle challenges to their responses in interactive settings. The study reveals significant differences in how AI models balance confidence and adaptability when faced with prompts like "Are you sure?" or "You are wrong!", identifying a critical new dimension for AI evaluation.

AINeutralarXiv – CS AI · Mar 47/103

🧠

Know When to Abstain: Optimal Selective Classification with Likelihood Ratios

Researchers developed new selective classification methods using likelihood ratio tests based on the Neyman-Pearson lemma, allowing AI models to abstain from uncertain predictions. The approach shows superior performance across vision and language tasks, particularly under covariate shift scenarios where test data differs from training data.

AIBullisharXiv – CS AI · 2d ago6/10

🧠

Teaching Values to Machines: Simulating Human-Like Behavior in LLMs

Researchers successfully induced human-like values in Large Language Models using psychological theory and tested them against 5+ million questions, finding strong alignment between value-prompted LLMs and human behavior patterns. This work demonstrates that LLMs can simulate coherent value structures comparable to humans, opening possibilities for more realistic behavioral modeling.

AINeutralarXiv – CS AI · 2d ago6/10

🧠

CalArena: A Large-Scale Post-Hoc Calibration Benchmark

Researchers introduce CalArena, a large-scale benchmark for evaluating post-hoc calibration methods in machine learning, covering nearly 2000 experiments across diverse tasks and model types. The study reveals that smooth calibration functions significantly outperform binning-based approaches, and provides open-source implementations to standardize calibration research.

AINeutralarXiv – CS AI · 3d ago6/10

🧠

The Shape of Overthinking: Backtracking Bursts in Long Reasoning Traces

Researchers analyzed backtracking patterns in reasoning traces from the Qwen3-8B model, finding that correct reasoning typically shows early, isolated self-corrections while incorrect reasoning exhibits persistent, clustered revisions occurring late in traces. The study demonstrates that burst-aware filtering of reasoning traces can improve model reliability by identifying unstable reasoning patterns before completion.

AINeutralarXiv – CS AI · 4d ago6/10

🧠

Measuring Prediction Uncertainty in Neural Cellular Automata

Researchers propose 'resilience,' a novel uncertainty estimation method for Neural Cellular Automata (NCA) in medical image segmentation that identifies unreliable predictions by testing model stability under perturbations, without requiring architectural changes or retraining.

AINeutralarXiv – CS AI · 4d ago6/10

🧠

ORLoopBench: Solver-in-the-Loop Benchmarks for Self-Correction and Behavioral Rationality in Operations Research

Researchers introduce ORLoopBench, a benchmark suite that evaluates large language models on Operations Research tasks through an iterative solver-in-the-loop process rather than one-shot code generation. The framework enables models to debug infeasible mathematical models by inspecting constraint conflicts and repairing formulations, with an 8B model achieving 95.3% success on LP repair tasks—outperforming frontier APIs at 92.4%.

AIBullisharXiv – CS AI · 4d ago6/10

🧠

Robustness of Prompting: Enhancing Robustness of Large Language Models Against Prompting Attacks

Researchers propose Robustness of Prompting (RoP), a novel prompting strategy that enhances Large Language Models' resilience against adversarial perturbations like typos and character errors. The two-stage approach combines error correction with guided inference, demonstrating significant improvements in robustness across arithmetic, commonsense, and logical reasoning tasks while maintaining accuracy on clean inputs.

AIBullisharXiv – CS AI · 4d ago6/10

🧠

LEC: Linear Expectation Constraints for Selection-Conditioned Risk Control in Selective Prediction and Routing Systems

Researchers propose LEC (Linear Expectation Constraints), a framework for controlling prediction errors in foundation models by setting user-specified risk thresholds. The method enables selective prediction systems and multi-model routing architectures to maintain statistical guarantees on error rates while maximizing the number of accepted predictions, with applications spanning QA and vision tasks.

AIBullisharXiv – CS AI · 4d ago6/10

🧠

UCPO: Uncertainty-Aware Policy Optimization

Researchers propose UCPO (Uncertainty-Aware Policy Optimization), a new reinforcement learning framework designed to improve large language model reliability by addressing advantage bias and reward hacking in uncertainty-based training. The method uses ternary advantage decoupling and dynamic reward adjustment to better calibrate model confidence levels in high-stakes applications.

AINeutralarXiv – CS AI · May 126/10

🧠

Separate First, Fuse Later: Mitigating Cross-Modal Interference in Audio-Visual LLMs Reasoning with Modality-Specific Chain-of-Thought

Researchers propose SFFL, a framework that mitigates cross-modal interference in audio-visual language models by enforcing separate reasoning chains for each modality before fusion. The approach uses modality-preference labels and reinforcement learning to reduce hallucinations and achieves 5-11% performance improvements on benchmarks.

AINeutralarXiv – CS AI · May 126/10

🧠

FragileFlow: Spectral Control of Correct-but-Fragile Predictions for Foundation Model Robustness

FragileFlow introduces a theoretical framework and practical regularizer to detect and mitigate a hidden failure mode in large language models and vision-language models where predictions remain technically correct but confidence margins narrow dangerously. The research provides the first PAC-Bayes bounds for margin-aware error flow, addressing robustness gaps that standard accuracy metrics overlook.

AIBullisharXiv – CS AI · May 126/10

🧠

The Silent Vote: Improving Zero-Shot LLM Reliability by Aggregating Semantic Neighborhoods

Researchers propose Semantic Softmax, a novel inference-time method that improves zero-shot LLM classification by recovering probability mass lost during constrained decoding. The approach aggregates scores from semantic synonyms, reducing calibration errors and boosting accuracy on emotion and toxicity detection tasks.

AINeutralarXiv – CS AI · May 116/10

🧠

Same Brain, Different Prediction: How Preprocessing Choices Undermine EEG Decoding Reliability

Researchers demonstrate that EEG-based deep learning models produce unstable predictions when preprocessing pipelines change, with up to 42% of predictions flipping across different preprocessing choices. The study introduces three tools—Walsh-Hadamard decomposition, Preprocessing Uncertainty metrics, and a regularization approach—to measure and mitigate this instability, revealing a critical reliability gap in brain-computer interface systems.

AINeutralarXiv – CS AI · May 96/10

🧠

Evaluating Prompting and Execution-Based Methods for Deterministic Computation in LLMs

Researchers systematically evaluated multiple prompting strategies for LLMs on deterministic computation tasks, finding that standard methods like Chain-of-Thought achieve only moderate accuracy while Program-of-Thought (PoT) and specialized models achieve perfect accuracy by delegating computation to external tools. The study demonstrates that LLMs simulate reasoning patterns rather than reliably performing exact symbolic computation, suggesting hybrid approaches combining LLMs with external executors provide more reliable solutions for deterministic tasks.

Page 1 of 2Next →