Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning
Researchers propose a novel reinforcement learning framework that automatically generates process-level supervision from outcome-only feedback, eliminating the need for costly external process supervision. This approach enables fine-grained credit assignment in reasoning tasks by having models identify and learn from their own failed trajectories.