AIBullisharXiv โ CS AI ยท 14h ago6/10
๐ง
Efficient Process Reward Modeling via Contrastive Mutual Information
Researchers propose CPMI, an automated method for training process reward models that reduces annotation costs by 84% and computational overhead by 98% compared to traditional Monte Carlo approaches. The technique uses contrastive mutual information to assign reward scores to reasoning steps in AI chain-of-thought trajectories without expensive human annotation or repeated LLM rollouts.