From Question Answering to Task Completion: A Survey on Agent System and Harness Design
A comprehensive survey examines LLM-based agent systems through a model-harness lens, arguing that agent performance depends on the interaction between foundation models, execution infrastructure, and task structure rather than model capabilities alone. The research identifies six core runtime responsibilities and maps how different harness configurations affect long-horizon task completion, efficiency, and reliability.
This survey represents a significant methodological shift in understanding LLM-based agents, moving beyond the common assumption that model scaling alone drives performance improvements. The research introduces a dual-lens framework separating foundation models from execution harnesses, recognizing that agent quality emerges from their interaction rather than residing in either component exclusively. This distinction matters because it clarifies where engineering efforts should focus—practitioners often optimize models while neglecting runtime infrastructure, potentially missing substantial performance gains.
The evolution from prompt engineering through workflows to agent-native training reflects the maturing AI infrastructure landscape. Earlier approaches treated agents as passive models with bolted-on tools, but this survey demonstrates that runtime design—encompassing observation, context management, control flow, action execution, state maintenance, and verification—fundamentally shapes task completion rates. The research provides empirical evidence linking harness configurations to specific task properties, creating a design framework for practitioners building production systems.
For developers and organizations deploying LLM agents, this work offers practical guidance on optimizing execution layers rather than exclusively chasing larger models. The identified open challenges—value-aware evaluation, safety assurance, harness generalization, and model-harness co-evolution—indicate the field remains early-stage with substantial optimization potential. The survey's systematic decomposition enables more rigorous benchmarking and comparison across agent systems, potentially accelerating standardization in agent engineering practices.
- →Agent performance bottlenecks may reside in execution harness design rather than foundation model capability alone.
- →Six core runtime responsibilities—observation, context, control, action, state, and verification—directly influence long-horizon task completion and efficiency.
- →Task-specific properties and domain constraints should drive harness configuration choices rather than one-size-fits-all approaches.
- →Model-harness co-evolution represents the emerging paradigm beyond traditional prompt engineering and workflow-based agent design.
- →Current evaluation practices lack value-aware metrics necessary for assessing agent quality across success, efficiency, safety, and generalization dimensions.