AINeutralarXiv – CS AI · 7h ago6/10
🧠
Isolating LLM Lexical Bias: A Curation-Free Triangulated Metric for Preference-Stage Learning
Researchers introduce the Triangulated Preference Shift score, an automated metric that identifies lexical biases introduced during preference learning stages (like RLHF) in large language models without requiring manual curation. The metric isolates language pattern shifts across six model families, revealing that preference tuning may push models toward a 'language of prestige' that diverges from natural human language usage.