y0news
AnalyticsDigestsSourcesRSSAICrypto
#entropy2 articles
2 articles
AIBullisharXiv โ€“ CS AI ยท Feb 277/106
๐Ÿง 

Know What You Know: Metacognitive Entropy Calibration for Verifiable RL Reasoning

Researchers propose EGPO, a new framework that improves large reasoning models by incorporating uncertainty awareness into reinforcement learning training. The approach addresses the "uncertainty-reward mismatch" where current training methods treat high and low-confidence solutions equally, preventing models from developing better reasoning capabilities.

AIBullisharXiv โ€“ CS AI ยท 5d ago6/103
๐Ÿง 

Quantile Advantage Estimation: Stabilizing RLVR for LLM Reasoning

Researchers propose Quantile Advantage Estimation (QAE) to stabilize Reinforcement Learning with Verifiable Rewards (RLVR) for large language model reasoning. The method replaces mean baselines with group-wise K-quantile baselines to prevent entropy collapse and explosion, showing sustained improvements on mathematical reasoning tasks.