AIBullisharXiv – CS AI · 9h ago7/10
🧠
OrderGrad: Optimizing Beyond the Mean with Order-Statistic Policy Gradient Estimation
OrderGrad introduces a family of gradient estimators that optimize order-statistic objectives rather than expected returns, enabling policy-gradient methods to directly target risk-sensitive metrics like Value-at-Risk, Conditional Value-at-Risk, and best-of-K outcomes. The method works as a plug-and-play reward transformation compatible with standard reinforcement learning algorithms, with applications demonstrated in LLM post-training and other domains.