OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents
Researchers introduce OR-Space, a comprehensive benchmark for evaluating large language model agents in industrial operations research workflows. Unlike existing benchmarks that focus on single-stage problem translation, OR-Space tests agents across persistent multi-artifact workspaces with three task modes—building optimization models, revising them under changing requirements, and explaining solutions—to assess real-world reliability and practical readiness.
OR-Space addresses a critical gap in AI evaluation methodology by moving beyond simplified benchmarking paradigms that fail to capture the complexity of industrial optimization work. Traditional OR benchmarks treat problem-solving as a discrete task where agents convert problem statements into mathematical models in isolation. This approach ignores the reality of professional workflows, where engineers work within persistent environments containing interdependent files, evolving requirements, and iterative refinement cycles.
The benchmark's three-task structure reflects authentic industry patterns. The Build phase evaluates whether agents can synthesize heterogeneous data sources—business documents, structured datasets, and existing code—into solver-ready models. The Revise phase tests a critical but understudied capability: maintaining logical consistency while modifying models in response to new constraints or solver feedback. The Explain phase demands grounding of abstract solutions in concrete business context, requiring agents to trace reasoning across distributed artifacts.
This work has significant implications for evaluating LLM agent maturity. Current enterprise AI adoption focuses heavily on simpler text-generation tasks; reliable optimization modeling remains a frontier capability. By introducing workspace persistence and lifecycle complexity, OR-Space establishes more realistic success criteria that better predict production deployment viability. The benchmark enables researchers to identify failure modes specific to industrial contexts—such as losing constraint validity during revisions or providing explanations disconnected from actual business implications.
For the AI industry, OR-Space serves as both research tool and development roadmap. It clarifies which agent capabilities remain underdeveloped and provides structured evaluation protocols for measuring progress. Future work will likely extend similar lifecycle-oriented benchmarking to other domains requiring persistent state management and multi-stage reasoning.
- →OR-Space introduces persistent workspace benchmarking that better reflects real industrial optimization workflows than existing single-stage problem translation tasks.
- →The benchmark evaluates agents across three distinct phases—Build, Revise, and Explain—testing reliability and practical readiness beyond end-to-end text generation.
- →Workspace persistence and interdependent files create realistic constraints that reveal failure modes in current LLM agents relevant to enterprise deployment.
- →The benchmark enables systematic evaluation of critical capabilities like maintaining logical consistency during model revisions under changing requirements.
- →OR-Space positions lifecycle-oriented evaluation as a new standard for assessing LLM agents in professional optimization and problem-solving domains.