#preference-learning News & Analysis

50 articles tagged with #preference-learning. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

50 articles

AINeutralarXiv – CS AI · Jun 237/10

🧠

DPO Unchained: Your Training Algorithm is Secretly Disentangled in Human Choice Theory (and its Loss' Convexity is Dispensable)

Researchers present a theoretical framework that generalizes Direct Preference Optimization (DPO) by connecting it to foundational human choice theory, demonstrating that DPO's loss function need not be convex and that various machine learning approaches can be compatible with different human choice models. This work provides a normative foundation for preference optimization algorithms used in training large language models.

AIBullisharXiv – CS AI · May 297/10

🧠

Hallucination Detection-Guided Preference Optimization for Clinical Summarization

Researchers introduce HDPO, a method that uses hallucination detectors to guide iterative refinement of AI-generated clinical summaries, reducing factual errors by up to 48% in large language models. The approach combines inference-time detection with preference learning for model finetuning, demonstrating significant improvements in factual accuracy while maintaining summary quality for healthcare applications.

🧠 Llama

AIBullisharXiv – CS AI · May 127/10

🧠

Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria

Researchers introduce Auto-Rubric as Reward (ARR), a framework that replaces opaque scalar reward signals in multimodal AI alignment with explicit, structured criteria-based evaluation. By externalizing a model's implicit preferences into interpretable rubrics before comparison, ARR reduces evaluation bias and enables more reliable human-preference alignment in generative models.

AIBullisharXiv – CS AI · May 47/10

🧠

Preference Goal Tuning: Post-Training as Latent Control for Frozen Policies

Researchers introduce Preference Goal Tuning (PGT), a novel post-training framework that optimizes goal embeddings as continuous control variables rather than updating frozen policy parameters. Testing on Minecraft SkillForge demonstrates PGT achieves 72-81% relative improvements over expert-crafted prompts while showing superior generalization in out-of-distribution settings compared to traditional fine-tuning.

AIBullisharXiv – CS AI · May 17/10

🧠

Debiasing Reward Models via Causally Motivated Inference-Time Intervention

Researchers propose a causally motivated method to reduce biases in reward models used for LLM alignment by identifying and suppressing neurons correlated with spurious features like response length. The technique achieves comparable performance to much larger models while editing less than 2% of neurons, suggesting biases are concentrated in early network layers.

AIBullisharXiv – CS AI · Mar 117/10

🧠

ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning

Researchers introduce ACTIVEULTRAFEEDBACK, an active learning pipeline that reduces the cost of training Large Language Models by using uncertainty estimates to identify the most informative responses for annotation. The system achieves comparable performance using only one-sixth of the annotated data compared to static baselines, potentially making LLM training more accessible for low-resource domains.

🏢 Hugging Face

AINeutralarXiv – CS AI · Mar 97/10

🧠

Aligning Compound AI Systems via System-level DPO

Researchers introduce SysDPO, a framework that extends Direct Preference Optimization to align compound AI systems comprising multiple interacting components like LLMs, foundation models, and external tools. The approach addresses challenges in optimizing complex AI systems by modeling them as Directed Acyclic Graphs and enabling system-level alignment through two variants: SysDPO-Direct and SysDPO-Sampling.

AIBullisharXiv – CS AI · Mar 47/103

🧠

Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy

Researchers introduce Skywork-Reward-V2, a suite of AI reward models trained on SynPref-40M, a massive 40-million preference pair dataset created through human-AI collaboration. The models achieve state-of-the-art performance across seven major benchmarks by combining human annotation quality with AI scalability for better preference learning.

AIBullisharXiv – CS AI · Mar 47/103

🧠

Density-Guided Response Optimization: Community-Grounded Alignment via Implicit Acceptance Signals

Researchers introduce Density-Guided Response Optimization (DGRO), a new AI alignment method that learns community preferences from implicit acceptance signals rather than explicit feedback. The technique uses geometric patterns in how communities naturally engage with content to train language models without requiring costly annotation or preference labeling.

AIBullishOpenAI News · Jun 137/107

🧠

Learning from human preferences

OpenAI and DeepMind have collaborated to develop an algorithm that can learn human preferences by comparing two proposed behaviors, eliminating the need for humans to manually write goal functions. This approach aims to reduce dangerous AI behavior that can result from oversimplified or incorrect goal specifications.

AINeutralarXiv – CS AI · Jun 256/10

🧠

TRUSTMEM: Learning Trustworthy Memory Consolidation for LLM Agents with Long-Term Memory

Researchers introduce TrustMem, a framework that improves the reliability of memory consolidation in LLM agents by verifying memory updates for accuracy and completeness. The system uses a Memory Transition Verifier and preference-guided reinforcement learning to reduce omissions, corruptions, and hallucinations in long-term memory systems by 40-79%, achieving state-of-the-art performance across multiple benchmarks.

AINeutralarXiv – CS AI · Jun 236/10

🧠

Repeated post-training is not Self-improving: Diagnosing Scientific Amnesia in Continual DPO Pipelines

Researchers identify 'scientific amnesia' as a critical failure mode in continual DPO (Direct Preference Optimization) training pipelines where LLMs preserve learned behaviors but fail to accumulate reusable methodological knowledge across sequential training campaigns. Testing five strategy proposers on a 30-campaign benchmark reveals that most approaches degrade performance, with only conservative rule-based scheduling showing consistent improvement.

AINeutralarXiv – CS AI · Jun 236/10

🧠

Self-Evolution for Multi-Turn Tool-Calling Agents via Divergence-Point Preference Learning

Researchers present ToolGraph, a framework that improves multi-turn tool-using AI agents through self-evolution via preference learning. By combining schema-derived topology with divergence-point preference optimization, the system achieves 16.8% improvement over baseline performance on benchmark tasks, with gains concentrated in airline and retail domains.

AINeutralarXiv – CS AI · Jun 195/10

🧠

PrefSQA: Pairwise Preference Prediction for Speech Quality Assessment and the Critical Role of High Quality Datasets

Researchers introduce PrefSQA, a machine learning method that predicts speech quality through pairwise preference comparisons rather than traditional mean opinion scores (MOS). The approach incorporates uncertainty-aware logits and attention mechanisms, demonstrating that preference-based labeling produces cleaner, more reliable datasets than scalar MOS ratings, though improvements vary significantly based on dataset quality.

AINeutralarXiv – CS AI · Jun 196/10

🧠

AAPA: Adversarially Anchored Preference Alignment for Post-Training of Large Language Models

Researchers propose AAPA (Adversarially Anchored Preference Alignment), a framework that enhances large language model post-training by combining supervised fine-tuning with reinforcement learning while using adversarial anchoring to prevent model drift from expert behavior. The method demonstrates consistent improvements across model scales, with performance gains of 3.75-5.77% on benchmark tests.

AIBullisharXiv – CS AI · Jun 96/10

🧠

A Regret Minimization Framework on Preference Learning in Large Language Models

Researchers introduce Regret-based Preference Optimization (RePO), a new framework for training large language models that reinterprets reinforcement learning from human feedback (RLHF) through regret minimization rather than reward maximization. The approach models human preferences as behavior-conditioned assessments of relative suboptimality, showing consistent performance gains on mathematical reasoning and preference benchmarks.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Large Language Models Should Learn Personalized Rather Than Aggregated Human Preferences

A position paper argues that large language models should optimize for individual user preferences rather than aggregated 'average user' preferences, which masks critical information about preference diversity and values. The authors propose bounded personalization frameworks that balance individual autonomy with universal safety constraints, while addressing scalability and manipulation risks.

AINeutralarXiv – CS AI · Jun 96/10

🧠

DOG-DPO:Dynamic Optimization in Geometry for Safety Alignment

Researchers introduce DOG-DPO, a training-free data selection framework that optimizes safety alignment for large language models by treating preference pairs as geometric signals. The method achieves comparable safety performance using only 11% of preference data, significantly reducing computational costs and redundancy in alignment datasets.

AINeutralarXiv – CS AI · Jun 95/10

🧠

Provably Efficient Personalized Multi-Objective Bandits with Proactive Conversational Queries

Researchers present MO-PQUCB, a novel algorithm for personalized multi-objective decision-making that combines conversational queries with bandit feedback to learn user preferences more efficiently. The method uses a Plackett-Luce choice model and shift-invariant regularization to overcome fundamental learning barriers, demonstrating improved regret scaling and robustness to corrupted preference signals compared to existing approaches.

CryptoNeutralarXiv – CS AI · Jun 96/10

⛓️

From Validator Selection to Portfolio Collection Optimization in Proof-of-Stake Blockchains

Researchers propose a decision-support framework for nominators in proof-of-stake blockchains to optimize validator selection across multiple accounts using multi-objective optimization. The system balances portfolio quality and profitability against diversification and risk mitigation through an interactive navigation procedure.

AIBullisharXiv – CS AI · Jun 56/10

🧠

Statistical Priors for Implicit Preferences: Decoupling Skill Selection as a Local Harness in Personal Agents

Researchers propose a decoupled architecture for personal AI agents that separates statistical preference learning from semantic intent parsing, enabling lightweight local deployment. The approach uses localized statistical data to modulate remote LLM skill selection decisions, achieving lower regret and higher accuracy than traditional memory-augmented agents.

AINeutralarXiv – CS AI · Jun 56/10

🧠

Learning to Route LLMs from Implicit Cost-Performance Preferences via Meta-Learning

Researchers introduce MetaRouter, a meta-learning framework that optimizes Large Language Model routing by learning individual users' implicit cost-performance preferences through minimal interaction. The system enables personalized query routing across multiple models, balancing expense reduction with performance maintenance more effectively than existing methods.

AINeutralarXiv – CS AI · Jun 46/10

🧠

Sparse Mixture-of-Experts Reward Models Learn Interpretable and Specialized Experts for Personalized Preference Modeling

Researchers propose a sparse Mixture-of-Experts (MoE) reward model that learns interpretable, specialized experts for modeling diverse human preferences in RLHF systems. By encouraging sparse routing during training on binary preference data, the approach improves both interpretability and personalization capabilities compared to universal reward function models.

AINeutralarXiv – CS AI · Jun 46/10

🧠

DEFLECT: Temporal Counterfactual Preference Learning for Delay-Robust Asynchronous VLAs

Researchers introduce DEFLECT, an offline post-training framework that improves Vision-Language-Action (VLA) robot policies by addressing latency-induced misalignment in asynchronous inference. The method uses counterfactual preference learning to teach policies to favor execution-time-aligned actions over stale prediction-time actions, achieving up to 6.4 percentage-point improvements in high-latency success rates without requiring human labels, reward models, or architectural changes.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Isolating LLM Lexical Bias: A Curation-Free Triangulated Metric for Preference-Stage Learning

Researchers introduce the Triangulated Preference Shift score, an automated metric that identifies lexical biases introduced during preference learning stages (like RLHF) in large language models without requiring manual curation. The metric isolates language pattern shifts across six model families, revealing that preference tuning may push models toward a 'language of prestige' that diverges from natural human language usage.

Page 1 of 2Next →