Researchers propose Cross-Model Entropy (CME), a label-free reward signal for reinforcement learning that uses a separate verifier model's likelihood assessment instead of human labels or self-referential signals. The method successfully extends RL post-training to open-ended instruction following across multiple model families, achieving win rates of 52.5-71.4% in head-to-head comparisons.
The bottleneck in post-training large language models has long been the reward signal—the mechanism that guides models toward desired behaviors. Traditional approaches rely on either expensive human preference labels or ground-truth verifiers limited to domains with automatic correctness checks like mathematics and code. This paper addresses a critical limitation in recent label-free approaches that use self-referential signals such as majority voting or token entropy, which can inadvertently reinforce a model's own errors and biases.
Cross-Model Entropy represents a conceptual shift by leveraging an independent verifier model's assessment of generated responses. Rather than relying on a generator's self-consistency, CME measures how "unsurprising" a response appears to a separate evaluator, operating on the intuition that high-quality outputs should align with external verification. This architecture prevents reward hacking through self-consistency while remaining training-free and computationally efficient.
The empirical validation across four model families (Qwen, Llama, Gemma, OLMo) in multiple training regimes (pretrained, supervised fine-tuning, instruction-tuned) demonstrates broad applicability. Notably, the approach extends label-free RL to open-ended instruction following—a historically challenging domain where existing self-referential methods perform poorly. This represents meaningful progress in making advanced model post-training more accessible and scalable without human annotation overhead.
For the broader AI research community, CME offers a practical pathway to improving instruction-following capabilities without proportional increases in annotation costs. The method's integration into GRPO with minimal changes suggests straightforward adoption potential, though real-world deployment would require careful consideration of verifier model selection and computational efficiency at scale.
- →Cross-Model Entropy uses an independent verifier model's likelihood as a label-free reward signal, eliminating human annotation and self-consistency gaming.
- →CME successfully extends reinforcement learning post-training to open-ended instruction following, historically limited to domains with automatic correctness checks.
- →Empirical results show 52.5-71.4% win rates across four model families and three training regimes compared to untrained baselines.
- →The approach is training-free, continuous, and integrates seamlessly into existing GRPO training pipelines without modification.
- →Verifier independence prevents models from exploiting reward signals through self-referential consistency, addressing a key vulnerability in prior label-free methods.