AIBullisharXiv – CS AI · 14h ago7/10
🧠
Label-Free Reinforcement Learning via Cross-Model Entropy
Researchers propose Cross-Model Entropy (CME), a label-free reward signal for reinforcement learning that uses a separate verifier model's likelihood assessment instead of human labels or self-referential signals. The method successfully extends RL post-training to open-ended instruction following across multiple model families, achieving win rates of 52.5-71.4% in head-to-head comparisons.
🧠 Llama