AINeutralarXiv β CS AI Β· 14h ago6/10
π§
Influencing Humans to Conform to Preference Models for RLHF
Researchers demonstrate that human preferences can be influenced to better align with the mathematical models used in RLHF algorithms, without changing underlying reward functions. Through three interventionsβrevealing model parameters, training humans on preference models, and modifying elicitation questionsβthe study shows significant improvements in preference data quality and AI alignment outcomes.