🧠 AI🟢 BullishImportance 7/10

Can LLMs Learn to Reason Robustly under Noisy Supervision?

arXiv – CS AI|Shenzhi Yang, Guangcheng Zhu, Bowen Song, Sharon Li, Haobo Wang, Xing Zheng, Yingfan Ma, Zhongqi Chen, Weiqiang Wang, Gang Chen|April 7, 2026 at 04:00 AM

🤖AI Summary

Researchers propose Online Label Refinement (OLR) to improve AI reasoning models' robustness under noisy supervision in Reinforcement Learning with Verifiable Rewards. The method addresses the critical problem of training language models when expert-labeled data contains errors, achieving 3-4% performance gains across mathematical reasoning benchmarks.

Key Takeaways

→RLVR training methods are vulnerable to noisy labels due to expert scarcity, with two distinct types of noise affecting model performance differently.
→Early Correctness Coherence phenomenon shows both clean and noisy samples improve similarly in early training stages before diverging.
→Online Label Refinement progressively corrects noisy labels using majority-voted answers when specific consistency conditions are met.
→OLR demonstrates consistent improvements across noise ratios from 10% to 90% on both in-distribution and out-of-distribution tasks.
→The approach shows promise for making AI reasoning systems more robust to imperfect training data in real-world applications.