y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#preference-data News & Analysis

1 article tagged with #preference-data. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

1 articles
AINeutralarXiv – CS AI Β· 14h ago7/10
🧠

What's In My Human Feedback? Learning Interpretable Descriptions of Preference Data

Researchers introduce WIMHF, a method using sparse autoencoders to decode what human feedback datasets actually measure and express about AI model preferences. The technique identifies interpretable features across 7 datasets, revealing diverse preference patterns and uncovering potentially unsafe biasesβ€”such as LMArena users voting against safety refusalsβ€”while enabling targeted data curation that improved safety by 37%.