AINeutralarXiv โ CS AI ยท 3h ago6/10
๐ง
TUR-DPO: Topology- and Uncertainty-Aware Direct Preference Optimization
Researchers introduce TUR-DPO, an improved method for aligning large language models with human preferences that incorporates reasoning topology and uncertainty awareness. Unlike standard Direct Preference Optimization, this approach evaluates not just answer correctness but the quality of the reasoning process, showing improvements across mathematical reasoning, factual QA, and dialogue tasks while maintaining training simplicity.