AIBullisharXiv – CS AI · 6h ago7/10
🧠
MindGames Arena Generalization Track: In2AI Solution with Delayed Per-Step Reward Attribution
Researchers introduced a novel reinforcement learning technique called delayed per-step reward attribution that enables language model agents to train effectively in multi-agent strategic environments where traditional per-step rewards fail. An 8-billion-parameter open-source model trained with this method won first place at NeurIPS 2025's MindGames Arena benchmark, outperforming substantially larger proprietary systems including GPT-5.
🧠 GPT-5