AINeutralarXiv – CS AI · 6h ago6/10
🧠
AAPA: Adversarially Anchored Preference Alignment for Post-Training of Large Language Models
Researchers propose AAPA (Adversarially Anchored Preference Alignment), a framework that enhances large language model post-training by combining supervised fine-tuning with reinforcement learning while using adversarial anchoring to prevent model drift from expert behavior. The method demonstrates consistent improvements across model scales, with performance gains of 3.75-5.77% on benchmark tests.