AIBullisharXiv โ CS AI ยท 6h ago10
๐ง
RE-PO: Robust Enhanced Policy Optimization as a General Framework for LLM Alignment
Researchers introduce RE-PO (Robust Enhanced Policy Optimization), a new framework that addresses noise in human preference data used to train large language models. The method uses expectation-maximization to identify unreliable labels and reweight training data, improving alignment algorithm performance by up to 7% on benchmarks.
$LINK