AIBullisharXiv โ CS AI ยท 5d ago6/103
๐ง
Quantile Advantage Estimation: Stabilizing RLVR for LLM Reasoning
Researchers propose Quantile Advantage Estimation (QAE) to stabilize Reinforcement Learning with Verifiable Rewards (RLVR) for large language model reasoning. The method replaces mean baselines with group-wise K-quantile baselines to prevent entropy collapse and explosion, showing sustained improvements on mathematical reasoning tasks.