#deployment-safety News & Analysis

5 articles tagged with #deployment-safety. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

5 articles

AINeutralarXiv – CS AI · Jun 127/10

🧠

Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior

Researchers challenge the reliability of broad personality assessments (Big 5) for predicting LLM behavior, finding that task-specific frameworks like Theory of Planned Behavior achieve human-level coherence within single conversations but fail across separate sessions when behavior is context-dependent. The study across 11 frontier LLMs suggests current psychometric evaluation methods are inadequate for safe AI deployment.

AIBearisharXiv – CS AI · May 77/10

🧠

Deployment-Relevant Alignment Cannot Be Inferred from Model-Level Evaluation Alone

A research paper challenges the reliability of current AI alignment benchmarks, arguing that model-level evaluations alone cannot predict real-world deployment safety. The study finds that existing benchmarks lack user-facing verification support and that scaffold effectiveness varies dramatically across different AI models, necessitating system-level evaluation approaches rather than single performance scores.

AINeutralarXiv – CS AI · May 296/10

🧠

Certified Policy Optimisation for Nested Causal Bandits via PAC-Bayes Risk

Researchers present Nested Causal Thompson Sampling (NCTS), a machine learning framework for sequential decision-making where strategic choices causally influence subsequent tactical decisions across multiple timescales. The work introduces PAC-Bayesian risk bounds that enable off-policy certification of deployment policies from historical data alone, enabling safer handover from legacy systems to learned agents.

AINeutralarXiv – CS AI · May 116/10

🧠

Beyond Confidence: Rethinking Self-Assessments for Performance Prediction in LLMs

Researchers propose using multidimensional self-assessment based on cognitive appraisal theory to predict LLM failures more reliably than confidence alone. Testing across 12 models and 38 tasks, they find effort and ability dimensions consistently outperform confidence, with task type determining which dimension proves most predictive.

AINeutralarXiv – CS AI · Apr 146/10

🧠

Principles Do Not Apply Themselves: A Hermeneutic Perspective on AI Alignment

A new arXiv paper argues that AI alignment cannot rely solely on stated principles because their real-world application requires contextual judgment and interpretation. The research shows that a significant portion of preference-labeling data involves principle conflicts or indifference, meaning principles alone cannot determine decisions—and these interpretive choices often emerge only during model deployment rather than in training data.