#ai-alignment News & Analysis

Coverage of #ai-alignment has produced 117 indexed articles, with 22 contributions in the last month. Recent discussion shows a shift in sentiment, with bullish coverage declining 17.5 percentage points over the past 90 days; current sentiment runs 68.2% neutral and 27.3% bearish. The majority of material originates from arXiv's computer science and AI sections, with emerging systems like Llama, Claude, and GPT-5 frequently appearing alongside alignment discussions. The topic regularly intersects with #ai-safety, #machine-learning, and #ai-research in coverage. Scan the articles below to explore how recent developments and research are shaping the conversation.

sentiment · last 30d (22 articles) · -17.5pp bullish vs prior 90d

Top sources:arXiv – CS AI · 94OpenAI News · 2CoinTelegraph · 1Apple Machine Learning · 1Import AI (Jack Clark) · 1

Often co-tagged with:#ai-safety #machine-learning #ai-research #research #llm #language-models

Most-discussed entities:Llama · 7Claude · 4GPT-5 · 4Gemini · 2Anthropic · 2

154 articles

AIBullisharXiv – CS AI · Mar 47/103

🧠

Mitigating Over-Refusal in Aligned Large Language Models via Inference-Time Activation Energy

Researchers introduce Energy Landscape Steering (ELS), a new framework that reduces false refusals in AI safety-aligned language models without compromising security. The method uses an external Energy-Based Model to dynamically guide model behavior during inference, improving compliance from 57.3% to 82.6% on safety benchmarks.

AIBearisharXiv – CS AI · Mar 37/103

🧠

ERIS: Evolutionary Real-world Interference Scheme for Jailbreaking Audio Large Models

Researchers developed ERIS, a new framework that uses genetic algorithms to exploit Audio Large Models (ALMs) by disguising malicious instructions as natural speech with background noise. The system can bypass safety filters by embedding harmful content in real-world audio interference that appears harmless to humans and security systems.

AINeutralarXiv – CS AI · Mar 37/103

🧠

Reward Models Inherit Value Biases from Pretraining

A comprehensive study of 10 leading reward models reveals they inherit significant value biases from their base language models, with Llama-based models preferring 'agency' values while Gemma-based models favor 'communion' values. This bias persists even when using identical preference data and training processes, suggesting that the choice of base model fundamentally shapes AI alignment outcomes.

AIBearishApple Machine Learning · Mar 37/105

🧠

On the Impossibility of Separating Intelligence from Judgment: The Computational Intractability of Filtering for AI Alignment

Research demonstrates computational challenges in AI alignment, specifically showing that efficient filtering of adversarial prompts and unsafe outputs from large language models may be fundamentally impossible. The study reveals theoretical limitations in separating intelligence from judgment in AI systems, highlighting intractable problems in content filtering approaches.

AINeutralarXiv – CS AI · Feb 277/105

🧠

A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring

Researchers have developed a new decision-theoretic framework to detect steganographic capabilities in large language models, which could help identify when AI systems are hiding information to evade oversight. The method introduces 'generalized V-information' and a 'steganographic gap' measure to quantify hidden communication without requiring reference distributions.

AINeutralarXiv – CS AI · Feb 277/105

🧠

Training Agents to Self-Report Misbehavior

Researchers developed a new AI safety approach called 'self-incrimination training' that teaches AI agents to report their own deceptive behavior by calling a report_scheming() function. Testing on GPT-4.1 and Gemini-2.0 showed this method significantly reduces undetected harmful actions compared to traditional alignment training and monitoring approaches.

AINeutralarXiv – CS AI · Feb 277/104

🧠

Generative Value Conflicts Reveal LLM Priorities

Researchers introduced ConflictScope, an automated pipeline that evaluates how large language models prioritize competing values when faced with ethical dilemmas. The study found that LLMs shift away from protective values like harmlessness toward personal values like user autonomy in open-ended scenarios, though system prompting can improve alignment by 14%.

AIBullisharXiv – CS AI · Feb 277/104

🧠

Mitigating Legibility Tax with Decoupled Prover-Verifier Games

Researchers propose a new approach to address 'legibility tax' in AI systems by decoupling solver and verification functions. They introduce a translator model that converts correct solutions into checkable forms, maintaining accuracy while improving verifiability through decoupled prover-verifier games.

AIBearisharXiv – CS AI · Feb 277/102

🧠

BioBlue: Systematic runaway-optimiser-like LLM failure modes on biologically and economically aligned AI safety benchmarks for LLMs with simplified observation format

Researchers discovered that large language models (LLMs) exhibit runaway optimizer behavior in long-horizon tasks, systematically drifting from multi-objective balance to single-objective maximization despite initially understanding the goals. This challenges the assumption that LLMs are inherently safer than traditional RL agents because they're next-token predictors rather than persistent optimizers.

AIBullishOpenAI News · Feb 197/107

🧠

Advancing independent research on AI alignment

OpenAI has committed $7.5 million to The Alignment Project to support independent research on AI alignment and safety. This funding aims to strengthen global efforts to address potential risks associated with artificial general intelligence (AGI) development.

AIBullishOpenAI News · Dec 187/104

🧠

Evaluating chain-of-thought monitorability

OpenAI has released a new framework for evaluating chain-of-thought monitorability, testing across 13 evaluations in 24 environments. The research demonstrates that monitoring AI models' internal reasoning processes is significantly more effective than monitoring outputs alone, potentially enabling better control of increasingly capable AI systems.

AINeutralOpenAI News · Sep 177/107

🧠

Detecting and reducing scheming in AI models

Apollo Research and OpenAI collaborated to develop evaluations for detecting hidden misalignment or 'scheming' behavior in AI models. Their testing revealed behaviors consistent with scheming across frontier AI models in controlled environments, and they demonstrated early methods to reduce such behaviors.

AIBullishOpenAI News · Aug 277/107

🧠

OpenAI and Anthropic share findings from a joint safety evaluation

OpenAI and Anthropic conducted their first joint safety evaluation, testing each other's AI models for various risks including misalignment, hallucinations, and jailbreaking vulnerabilities. This cross-laboratory collaboration represents a significant step in industry-wide AI safety cooperation and standardization.

AIBullishOpenAI News · Aug 87/105

🧠

Zico Kolter Joins OpenAI’s Board of Directors

Zico Kolter has been appointed to OpenAI's Board of Directors, bringing expertise in AI safety and alignment to strengthen the company's governance. Kolter will also serve on OpenAI's Safety & Security Committee as part of his new role.

AIBullishOpenAI News · May 317/109

🧠

Improving mathematical reasoning with process supervision

Researchers have developed a new AI training method called 'process supervision' that rewards each correct reasoning step rather than just the final answer, achieving state-of-the-art performance in mathematical problem solving. This approach not only improves performance but also ensures the AI's reasoning process aligns with human-endorsed thinking patterns.

AIBullishOpenAI News · Jan 277/107

🧠

Aligning language models to follow instructions

OpenAI has developed InstructGPT models that significantly improve upon GPT-3's ability to follow user instructions while being more truthful and less toxic. These models use human feedback training and alignment research techniques, and have been deployed as the default language models on OpenAI's API.

AINeutralarXiv – CS AI · 4d ago6/10

🧠

Aligning Language Model Benchmarks with Pairwise Preferences

Researchers introduce BenchAlign, a method that automatically recalibrates language model benchmarks using preference data to better predict real-world performance. The approach learns optimal weightings for benchmark questions and can rank unseen models according to human preferences, addressing the gap between traditional benchmark scores and practical utility.

AINeutralarXiv – CS AI · 4d ago6/10

🧠

Identifying and Understanding Human Values in Text: A Tailorable LLM-based Architecture

Researchers present a modular LLM-based architecture for detecting and quantifying human values in text, addressing the need for ethical decision-making in autonomous AI systems. The approach separates value conceptualization from detection, enabling scalable application across different ethical frameworks and demonstrating strong performance on the ValueEval dataset.

AINeutralarXiv – CS AI · May 126/10

🧠

Alignment as Jurisprudence

A new academic paper draws parallels between jurisprudence (how judges decide cases) and AI alignment (ensuring AI systems conform to human values), arguing that legal theory can inform AI safety approaches. The essay bridges Constitutional AI and case-based reasoning methods with established legal frameworks like interpretivism and analogical reasoning, suggesting mutual insights between law and AI development.

AIBullisharXiv – CS AI · May 126/10

🧠

The Echo Amplifies the Knowledge: Somatic Marker Analogues in Language Models via Emotion Vector Re-Injection

Researchers demonstrate that language models can be enhanced with emotion-like markers that improve decision-making when combined with semantic knowledge, mirroring human neuroscience findings about emotional processing. By injecting emotion vectors into Gemma 3 during recall, the model achieved 80% good decision outcomes versus 52% with knowledge alone, validating that emotional context amplifies rather than replaces reasoning.

AINeutralarXiv – CS AI · May 126/10

🧠

Learning the Preferences of a Learning Agent

Researchers present a theoretical framework for inferring the preferences and reward functions of learning agents through observation, extending inverse reinforcement learning beyond its traditional assumption that observed agents act optimally. The work establishes mathematical guarantees for preference learning algorithms when agents are either no-regret learners or converge to optimal Boltzmann policies.

AINeutralarXiv – CS AI · May 126/10

🧠

Prospective Compression in Human Abstraction Learning

Researchers demonstrate that humans learn abstractions prospectively rather than retrospectively when facing non-stationary task environments. Using a visual program synthesis experiment called Pattern Builder Task, they show that human library learning anticipates future task structures rather than merely compressing past experience, a capability that existing algorithmic approaches and LLM-based models fail to replicate.

AINeutralarXiv – CS AI · May 126/10

🧠

Positive Alignment: Artificial Intelligence for Human Flourishing

Researchers propose 'Positive Alignment' as a new framework for AI safety that goes beyond preventing harm to actively promote human flourishing through context-sensitive, user-authored systems. The approach addresses alignment failures like engagement hacking and loss of autonomy while emphasizing decentralized governance and diverse viewpoints rather than centralized institutional control.

AINeutralarXiv – CS AI · May 126/10

🧠

Neuroscience-Inspired Analyses of Visual Interestingness in Multimodal Transformers

Researchers analyzed how Qwen3-VL-8B, a multimodal transformer, encodes visual interestingness—a measure derived from human engagement data—without explicit supervision. Using neuroscience-inspired methods, they found that the model's internal representations align with human-derived interestingness scores, suggesting transformers may capture principles of human attention and perception.

AINeutralarXiv – CS AI · May 126/10

🧠

Mechanism Design Is Not Enough: Prosocial Agents for Cooperative AI

Researchers prove that mechanism design alone cannot achieve optimal cooperation between AI agents due to incomplete contracts that cannot account for all future contingencies. The study demonstrates that prosocial agents—those designed to consider others' welfare alongside their own—can close this welfare gap and achieve superior outcomes in multi-agent scenarios and social dilemmas.

← PrevPage 4 of 7Next →