AIBullisharXiv – CS AI · 10h ago6/10
🧠
Inverting the Bellman Equation: From $Q$-Values to World Models
Researchers demonstrate that value-based reinforcement learning agents trained on diverse reward functions implicitly encode accurate world models, bridging the traditional divide between model-free and model-based RL. They introduce P-learning, a method to extract these hidden environment models from Q-values, and show agents develop generalizable dynamics understanding beyond their training objectives.