AINeutralarXiv โ CS AI ยท 14h ago6/10
๐ง
Relational Preference Encoding in Looped Transformer Internal States
Researchers demonstrate that looped transformers like Ouro-2.6B encode human preferences relationally rather than independently, with pairwise evaluators achieving 95.2% accuracy compared to 21.75% for independent classification. The study reveals that preference encoding is fundamentally relational, functioning as an internal consistency probe rather than a direct predictor of human annotations.
๐ข Anthropic