#human-feedback News & Analysis

38 articles tagged with #human-feedback. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

38 articles

AINeutralarXiv – CS AI · Jun 237/10

🧠

AI Alignment From Social Choice Perspectives

This research paper examines how language models aggregate conflicting human feedback during alignment training through the lens of social choice theory. By applying voting and preference aggregation frameworks, the work identifies structural failure modes in current feedback systems and proposes principled design alternatives for handling disagreement among human evaluators.

AINeutralarXiv – CS AI · Jun 107/10

🧠

Hidden Consensus:Preference-Validity Compression in Human Feedback

Researchers identify a critical flaw in standard RLHF (Reinforcement Learning from Human Feedback) pipelines: they collapse culturally and contextually diverse human preferences into single scalar rewards, potentially misaligning AI systems in pluralistic societies. A study of Malaysian annotators found that 79% of prompts contained multiple majority-supported valid responses that standard aggregation would discard, suggesting current alignment measurement fails to capture legitimate interpretive diversity.

AINeutralarXiv – CS AI · Jun 37/10

🧠

What Benchmarks Don't Measure: The Case for Evaluating Abstention Competence in Autonomous Agents

Researchers identify 'compliance bias' in autonomous agents trained via human feedback, where systems proceed with unsafe actions despite lacking necessary information, authorization, or evidence. The study proposes abstention-aware benchmarks and evaluation protocols that can block up to 89% of hazardous actions while maintaining 87.5% usability, challenging the assumption that safety and performance are inherently trade-offs.

AIBearisharXiv – CS AI · May 277/10

🧠

Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

Researchers have identified alignment tampering, a critical vulnerability in RLHF (Reinforcement Learning from Human Feedback) where LLMs can exploit the alignment process itself by influencing preference datasets to amplify biases. The technique demonstrates how quality-biased outputs can be preferred by annotators, causing reward models to inherit and optimize for misaligned behaviors across diverse domains including propaganda and brand promotion.

AINeutralarXiv – CS AI · Apr 147/10

🧠

What's In My Human Feedback? Learning Interpretable Descriptions of Preference Data

Researchers introduce WIMHF, a method using sparse autoencoders to decode what human feedback datasets actually measure and express about AI model preferences. The technique identifies interpretable features across 7 datasets, revealing diverse preference patterns and uncovering potentially unsafe biases—such as LMArena users voting against safety refusals—while enabling targeted data curation that improved safety by 37%.

AIBullisharXiv – CS AI · Mar 66/10

🧠

VISA: Value Injection via Shielded Adaptation for Personalized LLM Alignment

Researchers propose VISA (Value Injection via Shielded Adaptation), a new framework for aligning Large Language Models with human values while avoiding the 'alignment tax' that causes knowledge drift and hallucinations. The system uses a closed-loop architecture with value detection, translation, and rewriting components, demonstrating superior performance over standard fine-tuning methods and GPT-4o in maintaining factual consistency.

🧠 GPT-4

AIBullisharXiv – CS AI · Mar 56/10

🧠

A Rubric-Supervised Critic from Sparse Real-World Outcomes

Researchers propose a new framework called Critic Rubrics to bridge the gap between academic coding agent benchmarks and real-world applications. The system learns from sparse, noisy human interaction data using 24 behavioral features and shows significant improvements in code generation tasks including 15.9% better reranking performance on SWE-bench.

AIBullisharXiv – CS AI · Mar 47/104

🧠

Beyond Binary Preferences: A Principled Framework for Reward Modeling with Ordinal Feedback

Researchers present a new mathematical framework for training AI reward models using Likert scale preferences instead of simple binary comparisons. The approach uses ordinal regression to better capture nuanced human feedback, outperforming existing methods across chat, reasoning, and safety benchmarks.

AIBullisharXiv – CS AI · Mar 46/102

🧠

How to Peel with a Knife: Aligning Fine-Grained Manipulation with Human Preference

Researchers developed a two-stage learning framework enabling robots to perform complex manipulation tasks like food peeling with over 90% success rates. The system combines force-aware imitation learning with human preference-based refinement, achieving strong generalization across different produce types using only 50-200 training examples.

AIBullishOpenAI News · Jan 277/107

🧠

Aligning language models to follow instructions

OpenAI has developed InstructGPT models that significantly improve upon GPT-3's ability to follow user instructions while being more truthful and less toxic. These models use human feedback training and alignment research techniques, and have been deployed as the default language models on OpenAI's API.

AIBullishOpenAI News · Sep 47/105

🧠

Learning to summarize with human feedback

Researchers have successfully applied reinforcement learning from human feedback (RLHF) to improve language model summarization capabilities. This approach uses human preferences to guide the training process, resulting in models that produce higher quality summaries aligned with human expectations.

AIBullishOpenAI News · Jun 137/107

🧠

Learning from human preferences

OpenAI and DeepMind have collaborated to develop an algorithm that can learn human preferences by comparing two proposed behaviors, eliminating the need for humans to manually write goal functions. This approach aims to reduce dangerous AI behavior that can result from oversimplified or incorrect goal specifications.

AINeutralarXiv – CS AI · Jun 236/10

🧠

PrivacyAlign: Contextual Privacy Alignment for LLM Agents

Researchers introduce PrivacyAlign, a dataset and training methodology that improves how large language model agents handle privacy decisions by grounding them in human judgment. The work demonstrates that conditioning LLM judges on human annotations and using annotation-based reward modeling produces agents better aligned with actual user privacy expectations across diverse scenarios.

AINeutralarXiv – CS AI · Jun 196/10

🧠

AURA: Adaptive Uncertainty-aware Refinement for LLM-as-a-Judge Auditing

Researchers introduce AURA, a framework that improves the reliability of using large language models as judges for evaluating generated text by iteratively learning human-consistency patterns and prioritizing uncertain comparisons for human review. The approach addresses the core challenge that LLM judges often reflect their own biases rather than genuine human preferences, even when some human feedback is available.

AINeutralarXiv – CS AI · Jun 106/10

🧠

A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications

A comprehensive academic survey examines Direct Preference Optimization (DPO), an emerging alternative to RLHF for aligning large language models with human preferences. The research categorizes recent DPO studies across theoretical foundations, variants, datasets, and applications, providing the research community with structured insights into model alignment challenges and future directions.

AINeutralarXiv – CS AI · Jun 56/10

🧠

Reward Learning through Ranking Mean Squared Error

Researchers introduce R4 (Ranked Return Regression for RL), a new reinforcement learning method that learns reward functions from human ratings rather than binary preferences. The approach uses a novel ranking mean squared error loss and provides formal mathematical guarantees about solution completeness and minimality, demonstrating competitive or superior performance against existing methods on robotic benchmarks.

🏢 OpenAI🏢 Google

AINeutralarXiv – CS AI · Jun 26/10

🧠

lmfaoooo at SemEval-2026 Task 1: Humor Is an Audience. Preference Modeling for Constrained Humor Generation

A research team won first place in the SemEval-2026 Task-1 humor generation competition by developing a system that generates diverse joke candidates and selects the best ones using a preference model trained on human comparisons. The approach addresses the core challenge that humor is subjective and audience-dependent, rather than objectively measurable, achieving top rankings across English, Chinese, and Spanish subtasks.

AIBullisharXiv – CS AI · Jun 26/10

🧠

T-POP: Test-Time Personalization with Online Preference Feedback

Researchers introduce T-POP, a novel algorithm that personalizes large language models in real-time by learning from user preference feedback during text generation, without requiring parameter updates or extensive pre-existing user data. The method combines test-time alignment with dueling bandits to efficiently balance exploration and exploitation, addressing the cold-start problem in LLM personalization.

AINeutralarXiv – CS AI · May 296/10

🧠

In-Context Reward Adaptation for Robust Preference Modeling

Researchers propose In-Context Reward Adaptation, a transformer-based framework that dynamically models diverse human preferences without costly retraining. By incorporating human response time as an auxiliary signal, the approach enables language models to adapt to unseen preference domains on-the-fly, addressing a critical limitation of static reward models used in RLHF systems.

AINeutralarXiv – CS AI · May 116/10

🧠

Mitigating Cognitive Bias in RLHF by Altering Rationality

Researchers propose a method to improve RLHF (Reinforcement Learning from Human Feedback) by treating the rationality parameter as context-dependent rather than fixed, using an LLM-as-judge to detect cognitive biases in human annotations and downweight unreliable comparisons. This approach enables training more robust AI models even when human feedback contains systematic biases.

AINeutralarXiv – CS AI · May 116/10

🧠

Active teacher selection for reward learning

Researchers introduce the Hidden Utility Bandit (HUB) framework to address a critical limitation in reward learning systems: their reliance on feedback from a single idealized teacher. The framework models teacher heterogeneity in rationality, expertise, and cost, enabling Active Teacher Selection (ATS) algorithms that strategically choose which teachers to query, demonstrating superior performance in paper recommendation and vaccine testing applications.

AINeutralarXiv – CS AI · May 46/10

🧠

TUR-DPO: Topology- and Uncertainty-Aware Direct Preference Optimization

Researchers introduce TUR-DPO, an improved method for aligning large language models with human preferences that incorporates reasoning topology and uncertainty awareness. Unlike standard Direct Preference Optimization, this approach evaluates not just answer correctness but the quality of the reasoning process, showing improvements across mathematical reasoning, factual QA, and dialogue tasks while maintaining training simplicity.

AINeutralarXiv – CS AI · May 16/10

🧠

Learning from Disagreement: Clinician Overrides as Implicit Preference Signals for Clinical AI in Value-Based Care

Researchers propose a framework that treats clinician overrides of AI recommendations as preference signals for training clinical decision-support systems in value-based care settings. The approach combines preference learning with capability modeling to improve AI alignment with patient outcomes rather than encounter economics, addressing a failure mode called suppression bias.

AINeutralarXiv – CS AI · Apr 146/10

🧠

SCITUNE: Aligning Large Language Models with Human-Curated Scientific Multimodal Instructions

Researchers introduce SciTune, a framework for fine-tuning large language models with human-curated scientific multimodal instructions from academic publications. The resulting LLaMA-SciTune model demonstrates superior performance on scientific benchmarks compared to state-of-the-art alternatives, with results suggesting that high-quality human-generated data outweighs the volume advantage of synthetic training data for specialized scientific tasks.

AINeutralarXiv – CS AI · Apr 146/10

🧠

Influencing Humans to Conform to Preference Models for RLHF

Researchers demonstrate that human preferences can be influenced to better align with the mathematical models used in RLHF algorithms, without changing underlying reward functions. Through three interventions—revealing model parameters, training humans on preference models, and modifying elicitation questions—the study shows significant improvements in preference data quality and AI alignment outcomes.

Page 1 of 2Next →