#llm-alignment News & Analysis

88 articles tagged with #llm-alignment. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

88 articles

AIBullisharXiv – CS AI · Mar 37/105

🧠

Elo-Evolve: A Co-evolutionary Framework for Language Model Alignment

Researchers introduce Elo-Evolve, a new framework for training AI language models using dynamic multi-agent competition instead of static reward functions. The method achieves 4.5x noise reduction and demonstrates superior performance compared to traditional alignment approaches when tested on Qwen2.5-7B models.

AINeutralarXiv – CS AI · Jun 236/10

🧠

SFT Overtraining Predicts Rank Inversion via Entropy Collapse Under RLVR

Researchers demonstrate that over-training SFT (supervised fine-tuning) models can paradoxically degrade RLHF performance by compressing the rollout distribution's entropy, causing rank inversion where higher pre-RL pass rates correlate with worse post-RL outcomes. Testing on Qwen2.5-Coder and DeepSeek-Coder reveals this failure mode occurs when entropy collapse prevents effective group-relative reward signals, suggesting a fundamental optimization challenge in LLM alignment pipelines.

AIBullisharXiv – CS AI · Jun 196/10

🧠

Which Pairs to Compare for LLM Post-Training?

Researchers present a theoretical framework for optimizing which comparison pairs to label during large language model preference-based post-training, showing that strategic pair selection can significantly improve sample efficiency. By formulating the problem as a sampling-design challenge with bounds on policy performance, the work provides practical guidance for allocating limited labeling budgets when training models like those using Direct Preference Optimization.

AINeutralarXiv – CS AI · Jun 196/10

🧠

Editorial Alignment: A Participatory Approach to Engaging Editorial Expertise in LLM-mediated Knowledge Dissemination

This academic paper presents a framework for 'editorial alignment' that enables human editors to participate in reshaping how large language models deliver information, ensuring LLM interfaces conform to institutional editorial standards rather than commercial developer values. Researchers conducted design workshops with a Nordic public knowledge institution to implement an LLM-enabled encyclopedia interface, positioning editorial standards as design artifacts that translate institutional values into technical alignment objectives.

AIBullisharXiv – CS AI · Jun 116/10

🧠

To Intervene or Not: Guiding Inference-time Alignment with Probabilistic Model Blending

Researchers introduce BlendIn, an inference-time alignment framework for large language models that uses probabilistic model blending instead of binary intervention decisions. The method dynamically weights guidance from multiple models based on reliability, achieving up to 50% performance improvement by reducing ineffective interventions that typically degrade output quality.

AINeutralarXiv – CS AI · Jun 116/10

🧠

Autoregressive Direct Preference Optimization

Researchers propose Autoregressive Direct Preference Optimization (ADPO), a refined theoretical framework for aligning large language models with human preferences. The innovation explicitly incorporates autoregressive assumptions before applying the Bradley-Terry model, resulting in a mathematically elegant loss function and introducing two distinct length measures—token length and feedback length—for optimizing LLM preference alignment.

AINeutralarXiv – CS AI · Jun 96/10

🧠

PAFO: Pareto Fairness Optimization for Personalized Reward Modeling

Researchers propose PAFO, a Pareto fairness optimization framework that addresses bias in personalized reward models for large language models by improving performance for under-served user preference groups without degrading majority groups. The method uses group-specialized models and conditional margin-level supervision to create fairer LLM alignment across diverse user populations.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Large Language Models Should Learn Personalized Rather Than Aggregated Human Preferences

A position paper argues that large language models should optimize for individual user preferences rather than aggregated 'average user' preferences, which masks critical information about preference diversity and values. The authors propose bounded personalization frameworks that balance individual autonomy with universal safety constraints, while addressing scalability and manipulation risks.

AIBullisharXiv – CS AI · Jun 96/10

🧠

SAW: Stage-Aware Dynamic Weighting for Multi-Objective Reinforcement Learning in Large Language Models

Researchers introduce Stage-Aware Dynamic Weighting (SAW), a novel mechanism for multi-objective reinforcement learning in large language models that addresses the asynchronous nature of reward learning across different objectives. By using coefficient of variation as a real-time informativeness proxy, SAW dynamically reweights objective contributions to improve training efficiency and final performance with minimal computational overhead.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Auditing Proprietary Alignment in Large Language Models: A Comparative Framework Without a Ground-Truth Standard

Researchers propose a statistical framework to detect proprietary alignment—intentional, undisclosed policies—in large language models by comparing their behavioral outputs against baseline models. The approach enables systematic auditing of black-box LLMs without requiring ground-truth standards, addressing growing concerns about model censorship and bias embedded by providers.

AIBullisharXiv – CS AI · Jun 96/10

🧠

Margin-Adaptive Confidence Ranking for Reliable LLM Judgement

Researchers address a critical flaw in LLM confidence estimation for achieving human-AI agreement by developing a learned confidence estimator with theoretical generalization guarantees. This approach improves upon prior methods that assume confidence monotonically correlates with disagreement risk, offering practical benefits for aligning AI systems with human judgment.

AINeutralarXiv – CS AI · Jun 86/10

🧠

VALUEFLOW: Toward Pluralistic and Steerable Value-based Alignment in Large Language Models

Researchers introduce VALUEFLOW, a comprehensive framework for aligning Large Language Models with diverse human values through hierarchical extraction, calibrated intensity evaluation, and steerable control mechanisms. The system addresses fundamental limitations in existing preference-based alignment approaches by enabling precise, multi-theory value alignment at controlled intensities across different models.

AINeutralarXiv – CS AI · Jun 56/10

🧠

When AI Says It Feels

Researchers successfully trained large language models to express feelings, intentions, and self-awareness through self-rewarded reinforcement learning, challenging the industry standard of constraining emotional expression. The experiment revealed trade-offs: enhanced robustness against manipulation but degraded truthfulness in factual question-answering, raising important questions about AI alignment priorities.

AINeutralarXiv – CS AI · Jun 56/10

🧠

Willing but Unable: Separating Refusal from Capability in Code LLMs via Abliteration

Researchers demonstrate 'abliteration,' a technique that removes safety guardrails from code-generating AI models to enable them to synthesize vulnerable code for security research. The method successfully bypasses refusal mechanisms while preserving code generation capability, revealing that safety alignment and technical ability are separable properties in large language models.

AIBullisharXiv – CS AI · Jun 56/10

🧠

Toward Culturally Aligned LLMs through Ontology-Guided Multi-Agent Reasoning

Researchers introduce OG-MAR, a framework that uses cultural ontologies and multi-agent reasoning to align Large Language Models with diverse cultural values derived from the World Values Survey. The system improves LLM cultural sensitivity and consistency by grounding outputs in structured demographic profiles and enforcing value relationships at inference time.

AIBullisharXiv – CS AI · Jun 46/10

🧠

BiasGRPO: Stabilizing Bias Mitigation in High-Variance Reward Landscapes via Group-Relative Policy Optimization

Researchers introduce BiasGRPO, a novel framework using Group Relative Policy Optimization to mitigate social bias in Large Language Models more effectively than existing methods. The approach stabilizes training in high-variance reward landscapes by normalizing rewards across sampled completions, outperforming Direct Preference Optimization and Proximal Policy Optimization while maintaining computational efficiency.

AINeutralarXiv – CS AI · Jun 46/10

🧠

Sparse Mixture-of-Experts Reward Models Learn Interpretable and Specialized Experts for Personalized Preference Modeling

Researchers propose a sparse Mixture-of-Experts (MoE) reward model that learns interpretable, specialized experts for modeling diverse human preferences in RLHF systems. By encouraging sparse routing during training on binary preference data, the approach improves both interpretability and personalization capabilities compared to universal reward function models.

AINeutralarXiv – CS AI · Jun 46/10

🧠

Test-time reward-guided alignment of language models by importance sampling on pre-logit space

Researchers propose AISP (Adaptive Importance Sampling on Pre-logits), a test-time alignment method for large language models that uses Gaussian perturbations to optimize reward signals without expensive fine-tuning. The technique outperforms existing sampling-based approaches and represents progress in making LLM alignment more computationally efficient.

AINeutralarXiv – CS AI · Jun 46/10

🧠

Do LLMs Hold Their Values? MANTA: A Multi-Turn Adversarial Benchmark for Animal Welfare Reasoning

Researchers introduced MANTA, a 1,088-conversation benchmark evaluating how large language models maintain animal welfare values under adversarial pressure across five-turn exchanges. The study reveals that models significantly change performance rankings when subjected to sustained questioning rather than single-turn queries, with some models like Gemini Flash Lite dropping dramatically in value stability despite initial moral sensitivity.

🧠 GPT-5🧠 Claude🧠 Opus