ORLoopBench: Solver-in-the-Loop Benchmarks for Self-Correction and Behavioral Rationality in Operations Research
Researchers introduce ORLoopBench, a benchmark suite that evaluates large language models on Operations Research tasks through an iterative solver-in-the-loop process rather than one-shot code generation. The framework enables models to debug infeasible mathematical models by inspecting constraint conflicts and repairing formulations, with an 8B model achieving 95.3% success on LP repair tasks—outperforming frontier APIs at 92.4%.
ORLoopBench addresses a critical gap in how LLMs are evaluated for Operations Research applications. Traditional benchmarks treat OR problem-solving as a single translation step from problem description to solver code, but practitioners in reality engage in iterative debugging cycles. This new framework formalizes that process as a Markov Decision Process where each model action triggers solver re-execution and recomputation of Irreducible Infeasible Subsystems, creating deterministic, verifiable feedback loops.
The benchmark's two-component structure—OR-Debug-Bench with 5,362 repair instances and OR-Bias-Bench evaluating operational rationality—reflects the multifaceted nature of OR problem-solving. By grounding evaluation in solver oracles rather than heuristic metrics, the researchers enable precise measurement of whether models genuinely solve problems or merely regenerate syntactically correct but semantically wrong code.
The results demonstrate tangible progress: solver-verified RLVR training pushed an 8B model to 95.3% success on linear programming repair, exceeding GPT-4 and other frontier models at 92.4%. More significantly, the work exposes a critical failure mode in larger models—the tendency to regenerate feasible but incorrect MILP formulations that solve the wrong underlying problem. This semantic drift represents a hidden liability in production systems.
The framework's transferability from LP to MILP repair suggests the learned patterns capture generalizable diagnostic reasoning rather than dataset artifacts. For the AI-for-science and enterprise automation sectors, this work establishes a methodology for rigorous, solver-grounded evaluation that extends beyond OR to other domains requiring iterative refinement and formal verification.
- →ORLoopBench enables process-level evaluation of LLM reasoning in Operations Research through solver-in-the-loop iteration rather than one-shot assessment
- →An 8B model trained with solver-verified feedback outperforms frontier APIs on LP repair tasks, achieving 95.3% vs 92.4% success rate
- →The benchmark exposes semantic drift in larger models that generate syntactically correct but semantically incorrect mathematical formulations
- →Solver-grounded RLVR training improves diagnostic behavior and transfers knowledge from linear to mixed-integer programming problems
- →This evaluation methodology establishes a replicable framework for rigorous AI assessment in formal problem domains requiring verification