Baseline-Free Policy Optimization for Neural Combinatorial Optimization
Researchers propose Group Relative Policy Optimization (GRPO), a baseline-free training algorithm for neural combinatorial optimization that eliminates the need for maintaining frozen policy copies. Testing on TSP and CVRP benchmarks shows GRPO prevents training collapse seen in standard REINFORCE while achieving competitive solution quality, offering a more stable alternative for routing problem optimization.
This research addresses a fundamental stability problem in neural combinatorial optimization training. Standard REINFORCE algorithms rely on baseline policies—frozen network copies used for variance reduction—but these baselines become unreliable on harder problem instances, causing training to degrade dramatically. The researchers demonstrate that GRPO, adapted from large language model alignment research, eliminates baselines entirely by normalizing advantages within groups of sampled trajectories, shifting variance reduction to a fundamentally different mechanism.
The empirical results are striking. On TSP-100, REINFORCE experiences catastrophic failure where solution quality plummets from 9.8 to 52.1 cost immediately after warmup, with no recovery despite extended training. GRPO prevents this collapse entirely. Across matched computational budgets, GRPO reaches within 2% of POMO, a strong multi-start baseline approach, without requiring external baseline maintenance. The research also evaluates P3O, another alignment-inspired algorithm using pairwise preference comparisons, finding it competitive on simpler instances but less stable on more complex CVRP problems.
This work matters because combinatorial optimization remains computationally expensive, and algorithmic improvements directly reduce training costs and improve solution quality. The vulnerability of baseline-dependent training on harder instances has likely hindered broader adoption of neural approaches for real-world routing problems. By demonstrating a more robust training pathway, this research could accelerate deployment of learned solvers in logistics and operations research. The cross-pollination of techniques from LLM alignment to combinatorial optimization reveals how advances in one AI domain unlock solutions in another, suggesting continued convergence between these previously separate research areas.
- →GRPO eliminates baseline-dependent training entirely by normalizing advantages within trajectory groups, improving stability on hard instances
- →Standard REINFORCE collapses catastrophically on TSP-100 while GRPO maintains consistent performance across problem difficulties
- →GRPO achieves within 2% solution quality of stronger POMO baseline while requiring no external policy maintenance
- →Techniques from LLM alignment research successfully transfer to neural combinatorial optimization
- →Baseline-free training reduces computational overhead and enables more reliable large-scale deployment of learned solvers