AINeutralarXiv β CS AI Β· 14h ago7/10
π§
What's In My Human Feedback? Learning Interpretable Descriptions of Preference Data
Researchers introduce WIMHF, a method using sparse autoencoders to decode what human feedback datasets actually measure and express about AI model preferences. The technique identifies interpretable features across 7 datasets, revealing diverse preference patterns and uncovering potentially unsafe biasesβsuch as LMArena users voting against safety refusalsβwhile enabling targeted data curation that improved safety by 37%.