AIBullisharXiv – CS AI · 4d ago7/10
🧠Researchers introduce HDPO, a method that uses hallucination detectors to guide iterative refinement of AI-generated clinical summaries, reducing factual errors by up to 48% in large language models. The approach combines inference-time detection with preference learning for model finetuning, demonstrating significant improvements in factual accuracy while maintaining summary quality for healthcare applications.
🧠 Llama
AIBullisharXiv – CS AI · May 127/10
🧠Researchers introduce Auto-Rubric as Reward (ARR), a framework that replaces opaque scalar reward signals in multimodal AI alignment with explicit, structured criteria-based evaluation. By externalizing a model's implicit preferences into interpretable rubrics before comparison, ARR reduces evaluation bias and enables more reliable human-preference alignment in generative models.
AIBullisharXiv – CS AI · May 47/10
🧠Researchers introduce Preference Goal Tuning (PGT), a novel post-training framework that optimizes goal embeddings as continuous control variables rather than updating frozen policy parameters. Testing on Minecraft SkillForge demonstrates PGT achieves 72-81% relative improvements over expert-crafted prompts while showing superior generalization in out-of-distribution settings compared to traditional fine-tuning.
AIBullisharXiv – CS AI · May 17/10
🧠Researchers propose a causally motivated method to reduce biases in reward models used for LLM alignment by identifying and suppressing neurons correlated with spurious features like response length. The technique achieves comparable performance to much larger models while editing less than 2% of neurons, suggesting biases are concentrated in early network layers.
AIBullisharXiv – CS AI · Mar 117/10
🧠Researchers introduce ACTIVEULTRAFEEDBACK, an active learning pipeline that reduces the cost of training Large Language Models by using uncertainty estimates to identify the most informative responses for annotation. The system achieves comparable performance using only one-sixth of the annotated data compared to static baselines, potentially making LLM training more accessible for low-resource domains.
🏢 Hugging Face
AINeutralarXiv – CS AI · Mar 97/10
🧠Researchers introduce SysDPO, a framework that extends Direct Preference Optimization to align compound AI systems comprising multiple interacting components like LLMs, foundation models, and external tools. The approach addresses challenges in optimizing complex AI systems by modeling them as Directed Acyclic Graphs and enabling system-level alignment through two variants: SysDPO-Direct and SysDPO-Sampling.
AIBullisharXiv – CS AI · Mar 47/103
🧠Researchers introduce Skywork-Reward-V2, a suite of AI reward models trained on SynPref-40M, a massive 40-million preference pair dataset created through human-AI collaboration. The models achieve state-of-the-art performance across seven major benchmarks by combining human annotation quality with AI scalability for better preference learning.
AIBullisharXiv – CS AI · Mar 47/103
🧠Researchers introduce Density-Guided Response Optimization (DGRO), a new AI alignment method that learns community preferences from implicit acceptance signals rather than explicit feedback. The technique uses geometric patterns in how communities naturally engage with content to train language models without requiring costly annotation or preference labeling.
AIBullishOpenAI News · Jun 137/107
🧠OpenAI and DeepMind have collaborated to develop an algorithm that can learn human preferences by comparing two proposed behaviors, eliminating the need for humans to manually write goal functions. This approach aims to reduce dangerous AI behavior that can result from oversimplified or incorrect goal specifications.
AINeutralarXiv – CS AI · 15h ago6/10
🧠Researchers introduce Reward Partition Optimization (RPO), a new method for training language models that eliminates the need for value function estimation in preference-based learning. RPO simplifies the optimization process by normalizing rewards through partition-based formulations, demonstrating superior performance compared to existing approaches like DRO and KTO across multiple model architectures.
AINeutralarXiv – CS AI · 15h ago6/10
🧠Researchers propose Bayesian Non-Negative Reward Model (BNRM), a framework that addresses reward hacking vulnerabilities in reinforcement learning from human feedback (RLHF) systems used to align large language models. The approach combines non-negative factor analysis with preference modeling to create more robust, interpretable reward systems resistant to biases and distribution shifts.
AINeutralarXiv – CS AI · 15h ago6/10
🧠Researchers propose 'Markov decision contests' as a new reinforcement learning framework that leverages pairwise preferences instead of scalar rewards, proving that stationary Markov policies are optimal and demonstrating superior learning efficiency in long-horizon problems compared to existing methods.
AIBullisharXiv – CS AI · 15h ago6/10
🧠Researchers propose SelSkill, a machine learning framework that improves how AI agents decide whether to invoke specific skills during task execution. The method demonstrates significant performance improvements on benchmark tasks by learning when to use skills versus skip them, addressing a gap in existing agentic AI systems that struggle with unnecessary skill invocations.
AINeutralarXiv – CS AI · 15h ago6/10
🧠Researchers introduce the Triangulated Preference Shift score, an automated metric that identifies lexical biases introduced during preference learning stages (like RLHF) in large language models without requiring manual curation. The metric isolates language pattern shifts across six model families, revealing that preference tuning may push models toward a 'language of prestige' that diverges from natural human language usage.
AIBullisharXiv – CS AI · 1d ago6/10
🧠Researchers propose FedVPA-GP, a federated learning framework that enables privacy-preserving alignment of large language models while preserving diverse user preferences instead of averaging them into a single monolithic reward model. The approach uses a Gumbel-Softmax prior and orthogonal loss to prevent posterior collapse and successfully disentangles conflicting user intents in decentralized settings.
AINeutralarXiv – CS AI · 6d ago6/10
🧠A new arXiv survey reframes large language model alignment tuning through a data-centric lens, decomposing alignment data construction into three stages: response synthesis, preference evaluation, and preference instantiation. By organizing existing alignment methods into a unified taxonomy, the research identifies design trade-offs and failure modes while establishing principles for improving alignment data pipeline design.
AINeutralarXiv – CS AI · 6d ago6/10
🧠Researchers propose novel algorithms (LDB-DF and NDB-DF) for contextual dueling bandits that handle delayed feedback—a critical real-world constraint in recommender systems and LLM alignment. The breakthrough involves an Inverse Probability Weighting mechanism that eliminates bias from delayed observations, achieving theoretical regret bounds of O(d√T) for linear settings.
AINeutralarXiv – CS AI · May 126/10
🧠Researchers propose a new approach to embedding text for collective decision-making that prioritizes preferential similarity over semantic similarity. The method uses synthetic training data to separate preference signals (stance and values) from semantic nuisance (style and wording), improving preference prediction across deliberation datasets.
🏢 Meta
AINeutralarXiv – CS AI · May 126/10
🧠Researchers present a theoretical framework for inferring the preferences and reward functions of learning agents through observation, extending inverse reinforcement learning beyond its traditional assumption that observed agents act optimally. The work establishes mathematical guarantees for preference learning algorithms when agents are either no-regret learners or converge to optimal Boltzmann policies.
AINeutralarXiv – CS AI · May 116/10
🧠Researchers introduce MOCI (Multi-Objective Constraint Inference), a novel framework that uses inverse reinforcement learning to extract safety constraints and individual preferences from diverse expert demonstrations where multiple experts have different objectives. The approach addresses limitations in existing methods that assume homogeneous expert behavior and offers improved computational efficiency.
AINeutralarXiv – CS AI · May 116/10
🧠Researchers introduce DT-PBO, a tree-based surrogate model for Preferential Bayesian Optimization that prioritizes interpretability over traditional Gaussian Process approaches. The method achieves competitive performance on benchmark functions while providing transparent insights into decision-maker preferences, addressing critical needs in high-stakes domains like healthcare.
$MKR
AINeutralarXiv – CS AI · May 76/10
🧠Researchers introduce StoryRMB, the first benchmark for evaluating reward models on story generation preferences, and develop StoryReward, a specialized reward model achieving 66.3% accuracy where existing models struggle. The work addresses the challenge of modeling subjective human preferences in narrative generation, enabling better alignment between LLM-generated stories and human expectations.
AINeutralarXiv – CS AI · May 16/10
🧠Researchers propose a framework that treats clinician overrides of AI recommendations as preference signals for training clinical decision-support systems in value-based care settings. The approach combines preference learning with capability modeling to improve AI alignment with patient outcomes rather than encounter economics, addressing a failure mode called suppression bias.
AINeutralarXiv – CS AI · Apr 146/10
🧠A new arXiv paper argues that AI alignment cannot rely solely on stated principles because their real-world application requires contextual judgment and interpretation. The research shows that a significant portion of preference-labeling data involves principle conflicts or indifference, meaning principles alone cannot determine decisions—and these interpretive choices often emerge only during model deployment rather than in training data.
AINeutralarXiv – CS AI · Apr 146/10
🧠Researchers demonstrate that looped transformers like Ouro-2.6B encode human preferences relationally rather than independently, with pairwise evaluators achieving 95.2% accuracy compared to 21.75% for independent classification. The study reveals that preference encoding is fundamentally relational, functioning as an internal consistency probe rather than a direct predictor of human annotations.
🏢 Anthropic