The Paradox of Outcome Optimization: A Causal Information-Theoretic Bound on Reasoning Shortcuts in LLMs
Researchers establish a theoretical framework explaining why large language models optimized through outcome-based reinforcement learning develop brittle reasoning despite strong benchmark performance. The study introduces 'Reward-Induced Manifold Collapse' and demonstrates that process reward models can prevent this failure mode by enforcing information constraints on reasoning steps.