Structure-Conditioned Actor-Critic Branches for Quality-Diversity Reinforcement Learning
Researchers introduce SV-QD-RL, a reinforcement learning framework that generates diverse policy repertoires by conditioning actor networks on learned structural masks and pairing them with branch-specific critics. The approach demonstrates improved performance on continuous control tasks while maintaining behavioral diversity through structure-aware archive management.
SV-QD-RL addresses a fundamental challenge in quality-diversity reinforcement learning: balancing policy performance with behavioral diversity without sacrificing learning efficiency. Traditional QD-RL methods diversify policies post-hoc or rely on value information after evaluation, but this research shifts focus upstream to the policy generation mechanism itself. By coupling structural conditioning—where neural network architectures are dynamically masked—with branch-specific value learning, the framework creates a more principled approach to behavioral specialization.
The technical innovation lies in treating each candidate policy as a complete learning unit comprising an actor network, structural mask, dedicated critic, and replay buffer. This decoupling enables independent value-learning trajectories while the structural masks ensure policies explore distinct subspaces of the neural network architecture. The branch-aware archive then evaluates candidates not just on behavioral diversity and return, but also on structural footprint and value-profile consistency, creating a richer selection mechanism.
For the reinforcement learning research community, this work demonstrates that architectural diversity during training complements behavioral diversity in the final repertoire. MuJoCo benchmark results validate that the approach achieves both strong individual policy performance and meaningful behavioral variety. The ablation studies confirm that structural conditioning, critic differentiation, and memory-consistency each contribute distinctly to specialization.
Looking forward, this research opens applications in multi-task control systems where switching between policies with different structural properties could enhance robustness. The framework's ability to provide selectable policies matching changing behavioral requirements suggests practical utility in adaptive control scenarios, though scaling to larger domains and demonstrating computational efficiency remains an open question.
- →SV-QD-RL couples actor network structure masks with branch-specific critics to generate behaviorally diverse policy repertoires more effectively than post-hoc diversification.
- →Each learning branch maintains independent value-learning trajectories through dedicated critics and replay buffers, enabling structural specialization during training.
- →The branch-aware archive evaluates policies using behavioral quality, structural footprint, and value-profile information rather than performance metrics alone.
- →Ablation studies confirm structural conditioning, critic differentiation, and memory-consistent refinement each contribute complementary benefits to behavioral diversity.
- →Schedule-aware repertoire evaluation demonstrates learned archives provide selectable policy alternatives for tasks with changing behavioral requirements.