#value-alignment News & Analysis

10 articles tagged with #value-alignment. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

10 articles

AINeutralarXiv – CS AI · 2d ago7/10

🧠

Dual Mechanisms of Value Expression: Intrinsic vs. Prompted Values in Large Language Models

Researchers demonstrate that large language models express values through two distinct but partially overlapping mechanisms: intrinsic values learned during training and prompted values elicited by explicit instructions. Using mechanistic analysis of value vectors and neurons, the study reveals that while both mechanisms share common components, they serve different functions—intrinsic values promote response diversity while prompted values enforce instruction compliance.

AINeutralarXiv – CS AI · May 127/10

🧠

Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

Researchers introduce Agent-ValueBench, the first comprehensive benchmark designed to measure and evaluate the values embedded in autonomous AI agents rather than just their underlying language models. The study reveals that agent values diverge significantly from LLM values and are shaped more decisively by system harnesses and embedded skills than by traditional model alignment or prompt engineering approaches.

AIBullisharXiv – CS AI · Mar 66/10

🧠

VISA: Value Injection via Shielded Adaptation for Personalized LLM Alignment

Researchers propose VISA (Value Injection via Shielded Adaptation), a new framework for aligning Large Language Models with human values while avoiding the 'alignment tax' that causes knowledge drift and hallucinations. The system uses a closed-loop architecture with value detection, translation, and rewriting components, demonstrating superior performance over standard fine-tuning methods and GPT-4o in maintaining factual consistency.

🧠 GPT-4

AIBullisharXiv – CS AI · Mar 37/103

🧠

EigenBench: A Comparative Behavioral Measure of Value Alignment

Researchers have developed EigenBench, a new black-box method for measuring how well AI language models align with human values. The system uses an ensemble of models to judge each other's outputs against a given constitution, producing alignment scores that closely match human evaluator judgments.

AINeutralarXiv – CS AI · 1d ago6/10

🧠

RoleCDE:Benchmarking and Mitigating Role-Alignment Trade-offs in Role-Playing Agents

Researchers introduce RoleCDE, a benchmark for evaluating role-playing agents in large language models, revealing a 'Role Value Decoupling' phenomenon where LLMs default to alignment-oriented decisions over role-specific values when conflicts arise. Fine-tuning with RoleCDE data effectively mitigates this behavior while preserving general performance.

AINeutralarXiv – CS AI · 5d ago6/10

🧠

Are LLMs Socially Adaptive? Contrasting Belief Evolution in Large Language Models and Humans

Researchers introduce FairMindSim, a simulation benchmark and BREM framework to evaluate how well large language models align with human ethical values through social economic games. Testing 1,017 humans against ten LLMs reveals that frontier models exhibit more human-like restraint and balanced decision-making compared to mid-tier models, which show rigid, overly punitive behavior.

🧠 GPT-5🧠 Gemini

AINeutralarXiv – CS AI · Apr 106/10

🧠

Distributional Open-Ended Evaluation of LLM Cultural Value Alignment Based on Value Codebook

Researchers introduce DOVE, a distributional evaluation framework that measures how well large language models align with cultural values through open-ended text generation rather than multiple-choice tests. The framework uses rate-distortion optimization to create a value codebook and unbalanced optimal transport to assess alignment, demonstrating 31.56% correlation with downstream tasks across 12 LLMs while requiring only 500 samples per culture.

AINeutralarXiv – CS AI · Mar 37/1010

🧠

Contesting Artificial Moral Agents

A research paper proposes a 5E framework (ethical, epistemological, explainable, empirical, evaluative) for contesting Artificial Moral Agents (AMAs) - AI systems with inherent moral reasoning capabilities. The framework includes spheres of ethical influence at individual, local, societal, and global levels, along with a timeline for developers to anticipate or self-contest their AMA technologies.

AINeutralarXiv – CS AI · Mar 36/104

🧠

Cognitive models can reveal interpretable value trade-offs in language models

Researchers developed a framework using cognitive models from psychology to analyze value trade-offs in language models, revealing how AI systems balance competing priorities like politeness and directness. The study shows LLMs' behavioral profiles shift predictably when prompted to prioritize certain goals and are influenced by reasoning budgets and training dynamics.

AIBearisharXiv – CS AI · Mar 44/102

🧠

Slurry-as-a-Service: A Modest Proposal on Scalable Pluralistic Alignment for Nutrient Optimization

This is a satirical academic paper that critiques AI pluralistic alignment research by using the absurd metaphor of 'mulching' humans into nutrient slurry. The authors parody current AI ethics frameworks to highlight how technical approaches to value alignment can potentially enable harmful systems.