AINeutralarXiv – CS AI · 2h ago6/10
🧠
EAPO: Entropy-Driven Adaptive Positive-Negative Sample Weighting for Policy Optimization in Open-Ended QA
Researchers propose EAPO, an entropy-driven adaptive method for training large reasoning models on open-ended question answering tasks. The approach dynamically adjusts the weighting of positive and negative samples during reinforcement learning training, demonstrating improved performance on medical QA datasets by balancing response diversity with stability.