AIBullisharXiv โ CS AI ยท 4h ago4
๐ง
Alignment through Meta-Weighted Online Sampling: Bridging the Gap between Data Generation and Preference Optimization
Researchers propose MetaAPO, a new framework for aligning large language models with human preferences that dynamically balances online and offline training data. The method uses a meta-learner to evaluate when on-policy sampling is beneficial, resulting in better performance while reducing online annotation costs by 42%.