y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#preference-learning News & Analysis

30 articles tagged with #preference-learning. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

30 articles
AIBullisharXiv – CS AI · 4d ago7/10
🧠

Hallucination Detection-Guided Preference Optimization for Clinical Summarization

Researchers introduce HDPO, a method that uses hallucination detectors to guide iterative refinement of AI-generated clinical summaries, reducing factual errors by up to 48% in large language models. The approach combines inference-time detection with preference learning for model finetuning, demonstrating significant improvements in factual accuracy while maintaining summary quality for healthcare applications.

🧠 Llama
AIBullisharXiv – CS AI · May 127/10
🧠

Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria

Researchers introduce Auto-Rubric as Reward (ARR), a framework that replaces opaque scalar reward signals in multimodal AI alignment with explicit, structured criteria-based evaluation. By externalizing a model's implicit preferences into interpretable rubrics before comparison, ARR reduces evaluation bias and enables more reliable human-preference alignment in generative models.

AIBullisharXiv – CS AI · May 47/10
🧠

Preference Goal Tuning: Post-Training as Latent Control for Frozen Policies

Researchers introduce Preference Goal Tuning (PGT), a novel post-training framework that optimizes goal embeddings as continuous control variables rather than updating frozen policy parameters. Testing on Minecraft SkillForge demonstrates PGT achieves 72-81% relative improvements over expert-crafted prompts while showing superior generalization in out-of-distribution settings compared to traditional fine-tuning.

AIBullisharXiv – CS AI · May 17/10
🧠

Debiasing Reward Models via Causally Motivated Inference-Time Intervention

Researchers propose a causally motivated method to reduce biases in reward models used for LLM alignment by identifying and suppressing neurons correlated with spurious features like response length. The technique achieves comparable performance to much larger models while editing less than 2% of neurons, suggesting biases are concentrated in early network layers.

AIBullisharXiv – CS AI · Mar 117/10
🧠

ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning

Researchers introduce ACTIVEULTRAFEEDBACK, an active learning pipeline that reduces the cost of training Large Language Models by using uncertainty estimates to identify the most informative responses for annotation. The system achieves comparable performance using only one-sixth of the annotated data compared to static baselines, potentially making LLM training more accessible for low-resource domains.

🏢 Hugging Face
AINeutralarXiv – CS AI · Mar 97/10
🧠

Aligning Compound AI Systems via System-level DPO

Researchers introduce SysDPO, a framework that extends Direct Preference Optimization to align compound AI systems comprising multiple interacting components like LLMs, foundation models, and external tools. The approach addresses challenges in optimizing complex AI systems by modeling them as Directed Acyclic Graphs and enabling system-level alignment through two variants: SysDPO-Direct and SysDPO-Sampling.

AIBullisharXiv – CS AI · Mar 47/103
🧠

Density-Guided Response Optimization: Community-Grounded Alignment via Implicit Acceptance Signals

Researchers introduce Density-Guided Response Optimization (DGRO), a new AI alignment method that learns community preferences from implicit acceptance signals rather than explicit feedback. The technique uses geometric patterns in how communities naturally engage with content to train language models without requiring costly annotation or preference labeling.

AIBullisharXiv – CS AI · Mar 47/103
🧠

Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy

Researchers introduce Skywork-Reward-V2, a suite of AI reward models trained on SynPref-40M, a massive 40-million preference pair dataset created through human-AI collaboration. The models achieve state-of-the-art performance across seven major benchmarks by combining human annotation quality with AI scalability for better preference learning.

AIBullishOpenAI News · Jun 137/107
🧠

Learning from human preferences

OpenAI and DeepMind have collaborated to develop an algorithm that can learn human preferences by comparing two proposed behaviors, eliminating the need for humans to manually write goal functions. This approach aims to reduce dangerous AI behavior that can result from oversimplified or incorrect goal specifications.

AIBullisharXiv – CS AI · 1d ago6/10
🧠

Federated Variational Preference Alignment with Gumbel-Softmax Prior for Personalized User Preferences

Researchers propose FedVPA-GP, a federated learning framework that enables privacy-preserving alignment of large language models while preserving diverse user preferences instead of averaging them into a single monolithic reward model. The approach uses a Gumbel-Softmax prior and orthogonal loss to prevent posterior collapse and successfully disentangles conflicting user intents in decentralized settings.

AINeutralarXiv – CS AI · 6d ago6/10
🧠

Alignment Tuning for Large Language Models: A Data-Centric Lens on Alignment Data Pipelines

A new arXiv survey reframes large language model alignment tuning through a data-centric lens, decomposing alignment data construction into three stages: response synthesis, preference evaluation, and preference instantiation. By organizing existing alignment methods into a unified taxonomy, the research identifies design trade-offs and failure modes while establishing principles for improving alignment data pipeline design.

AINeutralarXiv – CS AI · 6d ago6/10
🧠

Linear and Neural Dueling Bandits with Delayed Feedback

Researchers propose novel algorithms (LDB-DF and NDB-DF) for contextual dueling bandits that handle delayed feedback—a critical real-world constraint in recommender systems and LLM alignment. The breakthrough involves an Inverse Probability Weighting mechanism that eliminates bias from delayed observations, achieving theoretical regret bounds of O(d√T) for linear settings.

AINeutralarXiv – CS AI · May 126/10
🧠

Embeddings for Preferences, Not Semantics

Researchers propose a new approach to embedding text for collective decision-making that prioritizes preferential similarity over semantic similarity. The method uses synthetic training data to separate preference signals (stance and values) from semantic nuisance (style and wording), improving preference prediction across deliberation datasets.

🏢 Meta
AINeutralarXiv – CS AI · May 126/10
🧠

Learning the Preferences of a Learning Agent

Researchers present a theoretical framework for inferring the preferences and reward functions of learning agents through observation, extending inverse reinforcement learning beyond its traditional assumption that observed agents act optimally. The work establishes mathematical guarantees for preference learning algorithms when agents are either no-regret learners or converge to optimal Boltzmann policies.

AINeutralarXiv – CS AI · May 116/10
🧠

Multi-Objective Constraint Inference using Inverse reinforcement learning

Researchers introduce MOCI (Multi-Objective Constraint Inference), a novel framework that uses inverse reinforcement learning to extract safety constraints and individual preferences from diverse expert demonstrations where multiple experts have different objectives. The approach addresses limitations in existing methods that assume homogeneous expert behavior and offers improved computational efficiency.

AINeutralarXiv – CS AI · May 116/10
🧠

DT-PBO: an Interpretable Tree-based Surrogate Model for Preferential Bayesian Optimization

Researchers introduce DT-PBO, a tree-based surrogate model for Preferential Bayesian Optimization that prioritizes interpretability over traditional Gaussian Process approaches. The method achieves competitive performance on benchmark functions while providing transparent insights into decision-maker preferences, addressing critical needs in high-stakes domains like healthcare.

$MKR
AINeutralarXiv – CS AI · May 76/10
🧠

StoryAlign: Evaluating and Training Reward Models for Story Generation

Researchers introduce StoryRMB, the first benchmark for evaluating reward models on story generation preferences, and develop StoryReward, a specialized reward model achieving 66.3% accuracy where existing models struggle. The work addresses the challenge of modeling subjective human preferences in narrative generation, enabling better alignment between LLM-generated stories and human expectations.

AINeutralarXiv – CS AI · May 16/10
🧠

Learning from Disagreement: Clinician Overrides as Implicit Preference Signals for Clinical AI in Value-Based Care

Researchers propose a framework that treats clinician overrides of AI recommendations as preference signals for training clinical decision-support systems in value-based care settings. The approach combines preference learning with capability modeling to improve AI alignment with patient outcomes rather than encounter economics, addressing a failure mode called suppression bias.

AINeutralarXiv – CS AI · Apr 146/10
🧠

Principles Do Not Apply Themselves: A Hermeneutic Perspective on AI Alignment

A new arXiv paper argues that AI alignment cannot rely solely on stated principles because their real-world application requires contextual judgment and interpretation. The research shows that a significant portion of preference-labeling data involves principle conflicts or indifference, meaning principles alone cannot determine decisions—and these interpretive choices often emerge only during model deployment rather than in training data.

AINeutralarXiv – CS AI · Apr 146/10
🧠

Relational Preference Encoding in Looped Transformer Internal States

Researchers demonstrate that looped transformers like Ouro-2.6B encode human preferences relationally rather than independently, with pairwise evaluators achieving 95.2% accuracy compared to 21.75% for independent classification. The study reveals that preference encoding is fundamentally relational, functioning as an internal consistency probe rather than a direct predictor of human annotations.

🏢 Anthropic
AINeutralarXiv – CS AI · Apr 106/10
🧠

Explaining Neural Networks in Preference Learning: a Post-hoc Inductive Logic Programming Approach

Researchers propose using Inductive Learning of Answer Set Programs (ILASP) to create interpretable approximations of neural networks trained on preference learning tasks. The approach combines dimensionality reduction through Principal Component Analysis with logic-based explanations, addressing the challenge of explaining black-box AI models while maintaining computational efficiency.

AINeutralarXiv – CS AI · Apr 66/10
🧠

Random Is Hard to Beat: Active Selection in online DPO with Modern LLMs

Research from arXiv shows that Active Preference Learning (APL) provides minimal improvements over random sampling in training modern LLMs through Direct Preference Optimization. The study found that random sampling performs nearly as well as sophisticated active selection methods while being computationally cheaper and avoiding capability degradation.

AIBullisharXiv – CS AI · Mar 266/10
🧠

Safe Reinforcement Learning with Preference-based Constraint Inference

Researchers propose Preference-based Constrained Reinforcement Learning (PbCRL), a new approach for safe AI decision-making that learns safety constraints from human preferences rather than requiring extensive expert demonstrations. The method addresses limitations in existing Bradley-Terry models by introducing a dead zone mechanism and Signal-to-Noise Ratio loss to better capture asymmetric safety costs and improve constraint alignment.

AIBullisharXiv – CS AI · Mar 66/10
🧠

What Is Missing: Interpretable Ratings for Large Language Model Outputs

Researchers introduce the What Is Missing (WIM) rating system for Large Language Models that uses natural-language feedback instead of numerical ratings to improve preference learning. WIM computes ratings by analyzing cosine similarity between model outputs and judge feedback embeddings, producing more interpretable and effective training signals with fewer ties than traditional rating methods.

Page 1 of 2Next →