AINeutralarXiv โ CS AI ยท 5h ago1
๐ง
Why Does RLAIF Work At All?
Researchers propose the 'latent value hypothesis' to explain why Reinforcement Learning from AI Feedback (RLAIF) enables language models to self-improve through their own preference judgments. The theory suggests that pretraining on internet-scale data encodes human values in representation space, which constitutional prompts can elicit for value alignment.