AIBullisharXiv – CS AI · 6h ago6/10
🧠
BiPACE: Bisimulation-Guided Policy Optimization with Action Counterfactual Estimation for LLM Agents
Researchers introduce BiPACE, a novel advantage estimation method for training large language model agents that improves upon existing group-based reinforcement learning approaches. The method addresses fundamental credit assignment problems by using bisimulation-guided clustering and action-conditioned baselines, achieving significant performance improvements on benchmark tasks without requiring additional critics or rollouts.