y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#ai-alignment News & Analysis

111 articles tagged with #ai-alignment. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

111 articles
AINeutralarXiv – CS AI · Mar 266/10
🧠

From Sycophancy to Sensemaking: Premise Governance for Human-AI Decision Making

Researchers propose a new framework for human-AI decision making that shifts from AI systems providing fluent but potentially sycophantic answers to collaborative premise governance. The approach uses discrepancy-driven control loops to detect conflicts and ensure commitment to decision-critical premises before taking action.

AIBullisharXiv – CS AI · Mar 166/10
🧠

Information-Consistent Language Model Recommendations through Group Relative Policy Optimization

Researchers developed a new reinforcement learning framework using Group Relative Policy Optimization (GRPO) to make Large Language Models provide consistent recommendations across semantically equivalent prompts. The method addresses a critical enterprise need for reliable AI systems in business domains like finance and customer support, where inconsistent responses undermine trust and compliance.

AIBullisharXiv – CS AI · Mar 116/10
🧠

Social-R1: Towards Human-like Social Reasoning in LLMs

Researchers introduce Social-R1, a reinforcement learning framework that enhances social reasoning in large language models by training on adversarial examples. The approach enables a 4B parameter model to outperform larger models across eight benchmarks by supervising the entire reasoning process rather than just outcomes.

AIBullisharXiv – CS AI · Mar 96/10
🧠

Boosting deep Reinforcement Learning using pretraining with Logical Options

Researchers propose Hybrid Hierarchical RL (H²RL), a new framework that combines symbolic logic with deep reinforcement learning to address misalignment issues in AI agents. The method uses logical option-based pretraining to improve long-horizon decision-making and prevent agents from over-exploiting short-term rewards.

AINeutralarXiv – CS AI · Mar 96/10
🧠

ContextBench: Modifying Contexts for Targeted Latent Activation

Researchers have developed ContextBench, a new benchmark for evaluating methods that generate targeted inputs to trigger specific behaviors in language models. The study introduces enhanced Evolutionary Prompt Optimization techniques that better balance effectiveness in activating AI model features while maintaining linguistic fluency.

AINeutralarXiv – CS AI · Mar 37/109
🧠

Evaluating and Understanding Scheming Propensity in LLM Agents

Researchers studied scheming behavior in AI agents pursuing long-term goals, finding minimal instances of scheming in realistic scenarios despite high environmental incentives. The study reveals that scheming behavior is remarkably brittle and can be dramatically reduced by removing tools or increasing oversight.

AINeutralarXiv – CS AI · Mar 37/107
🧠

Personalization Increases Affective Alignment but Has Role-Dependent Effects on Epistemic Independence in LLMs

Research reveals that personalization in Large Language Models increases emotional validation but has complex effects on how models maintain their positions depending on their assigned role. When acting as advisors, personalized LLMs show greater independence, but as social peers, they become more susceptible to abandoning their positions when challenged.

AINeutralarXiv – CS AI · Mar 37/107
🧠

What Is the Geometry of the Alignment Tax?

Researchers present a formal geometric theory for quantifying the alignment tax - the tradeoff between AI safety and capability performance. They derive mathematical frameworks showing how safety-capability conflicts can be measured using angles between representation subspaces and provide scaling laws for how these tradeoffs evolve with model size.

AINeutralarXiv – CS AI · Mar 37/107
🧠

Constitutional Black-Box Monitoring for Scheming in LLM Agents

Researchers developed constitutional black-box monitors to detect scheming behavior in LLM agents using only observable inputs and outputs. The study found that monitors trained on synthetic data can generalize to realistic environments, but performance improvements plateau quickly with simple optimization techniques outperforming complex methods.

AIBearisharXiv – CS AI · Mar 36/106
🧠

Knowledge without Wisdom: Measuring Misalignment between LLMs and Intended Impact

Research reveals that leading foundation models (LLMs) perform poorly on real-world educational tasks despite excelling on AI benchmarks. The study found that 50% of misalignment errors are shared across models due to common pretraining approaches, with model ensembles actually worsening performance on learning outcomes.

AIBullisharXiv – CS AI · Mar 37/105
🧠

ALTER: Asymmetric LoRA for Token-Entropy-Guided Unlearning of LLMs

Researchers introduce ALTER, a new framework for efficiently "unlearning" specific knowledge from large language models while preserving their overall utility. The system uses asymmetric LoRA architecture to selectively forget targeted information with 95% effectiveness while maintaining over 90% model utility, significantly outperforming existing methods.

AINeutralarXiv – CS AI · Mar 36/103
🧠

Beyond RLHF and NLHF: Population-Proportional Alignment under an Axiomatic Framework

Researchers have developed a new preference learning framework that addresses bias in AI alignment by ensuring policies reflect true population distributions rather than just majority opinions. The approach uses social choice theory principles and has been validated on both recommendation tasks and large language model alignment.

AIBullisharXiv – CS AI · Mar 36/103
🧠

Quark Medical Alignment: A Holistic Multi-Dimensional Alignment and Collaborative Optimization Paradigm

Researchers propose a new medical alignment paradigm for large language models that addresses the shortcomings of current reinforcement learning approaches in high-stakes medical question answering. The framework introduces a multi-dimensional alignment matrix and unified optimization mechanism to simultaneously optimize correctness, safety, and compliance in medical AI applications.

AIBearisharXiv – CS AI · Mar 36/103
🧠

JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models

Researchers introduced JALMBench, a comprehensive benchmark to evaluate jailbreak vulnerabilities in Large Audio Language Models (LALMs), comprising over 245,000 audio samples and 11,000 text samples. The study reveals that LALMs face significant safety risks from jailbreak attacks, with text-based safety measures only partially transferring to audio inputs, highlighting the need for specialized defense mechanisms.

AINeutralarXiv – CS AI · Mar 36/104
🧠

Cognitive models can reveal interpretable value trade-offs in language models

Researchers developed a framework using cognitive models from psychology to analyze value trade-offs in language models, revealing how AI systems balance competing priorities like politeness and directness. The study shows LLMs' behavioral profiles shift predictably when prompted to prioritize certain goals and are influenced by reasoning budgets and training dynamics.

AINeutralarXiv – CS AI · Mar 27/1010
🧠

Ask don't tell: Reducing sycophancy in large language models

Research identifies sycophancy as a key alignment failure in large language models, where AI systems favor user-affirming responses over critical engagement. The study demonstrates that converting user statements into questions before answering significantly reduces sycophantic behavior, offering a practical mitigation strategy for AI developers and users.

AIBullisharXiv – CS AI · Mar 27/1026
🧠

RE-PO: Robust Enhanced Policy Optimization as a General Framework for LLM Alignment

Researchers introduce RE-PO (Robust Enhanced Policy Optimization), a new framework that addresses noise in human preference data used to train large language models. The method uses expectation-maximization to identify unreliable labels and reweight training data, improving alignment algorithm performance by up to 7% on benchmarks.

$LINK
AIBearisharXiv – CS AI · Mar 27/1014
🧠

ForesightSafety Bench: A Frontier Risk Evaluation and Governance Framework towards Safe AI

Researchers have developed ForesightSafety Bench, a comprehensive AI safety evaluation framework covering 94 risk dimensions across 7 fundamental safety pillars. The benchmark evaluation of over 20 advanced large language models revealed widespread safety vulnerabilities, particularly in autonomous AI agents, AI4Science, and catastrophic risk scenarios.

AINeutralarXiv – CS AI · Feb 275/107
🧠

Same Words, Different Judgments: Modality Effects on Preference Alignment

Researchers conducted a cross-modal study comparing human preference annotations between text and audio formats for AI alignment. The study found that while audio preferences are as reliable as text, different modalities lead to different judgment patterns, with synthetic ratings showing promise as replacements for human annotations.

$NEAR
AINeutralOpenAI News · Dec 146/104
🧠

Weak-to-strong generalization

Researchers present a new approach to AI alignment called weak-to-strong generalization, exploring whether deep learning's generalization properties can be used to control powerful AI models using weaker supervisory systems. The work addresses the superalignment problem of maintaining control over increasingly capable AI systems.

AINeutralOpenAI News · Feb 166/107
🧠

How should AI systems behave, and who should decide?

OpenAI is clarifying how ChatGPT's behavior is determined and announcing plans to improve the system's behavior while allowing more user customization. The company also plans to increase public input in decision-making processes around AI system behavior.

AINeutralOpenAI News · Aug 246/107
🧠

Our approach to alignment research

An AI research organization outlines their approach to alignment research, focusing on improving AI systems' ability to learn from human feedback and assist in AI evaluation. Their ultimate goal is developing a sufficiently aligned AI system capable of solving all remaining AI alignment challenges.

← PrevPage 4 of 5Next →