Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs
Researchers introduce OPT-BENCH, a framework for training LLMs on NP-hard optimization problems using quality-aware reinforcement learning. Testing on Qwen2.5-7B achieves 93.1% success rate and 46.6% quality ratio, substantially outperforming GPT-4o, with demonstrated transfer benefits across mathematics, logic, and reasoning tasks.
The research addresses a fundamental gap in LLM evaluation beyond binary correctness. Traditional reinforcement learning with verifiable rewards optimizes for right-or-wrong outcomes, but real-world applications demand solutions that are both feasible and optimal under resource constraints. OPT-BENCH reframes this by introducing quality-aware rewards that enable continuous improvement in solution quality rather than treating problems as pass-fail scenarios. This distinction matters significantly because many practical applications—routing, scheduling, resource allocation—require near-optimal answers, not just valid ones.
The benchmark's architecture reveals important scaling insights for LLM reasoning. By combining instance generators, quality verifiers, and optimal baselines across 10 diverse NP-hard problems, the framework enables rigorous comparison between models. The substantial performance gap between Qwen2.5-7B-Instruct and GPT-4o suggests that targeted training on constrained optimization improves reasoning capabilities in ways general instruction-tuning does not achieve.
Crucially, the transfer learning results indicate that optimization training builds more robust reasoning foundations. Gains across unrelated domains—mathematics, logic, knowledge tasks—suggest that engaging with constraint-satisfaction problems strengthens underlying problem-solving mechanisms. The finding that task diversity outweighs data quantity for generalization contradicts scaling assumptions and has implications for efficient training strategies.
For the AI development community, this work establishes concrete methodology for benchmarking and improving LLM capabilities beyond existing reasoning tasks. Future research likely extends these quality-aware reward approaches to other complex reasoning domains, potentially transforming how we evaluate and train reasoning-specialized models.
- →Quality-aware rewards improve optimization solutions by 28.8% compared to binary correctness-based rewards
- →Qwen2.5-7B trained on OPT-BENCH substantially outperforms GPT-4o on NP-hard problems (93.1% vs 29.6% success rate)
- →Training on optimization tasks transfers positively to diverse reasoning benchmarks including mathematics, logic, and instruction following
- →Task diversity is more important than data quantity for improving LLM generalization in complex reasoning
- →OPT-BENCH provides first comprehensive benchmark framework combining scalable training infrastructure and rigorous evaluation of both feasibility and solution quality