How can we assess human-agent interactions? Case studies in software agent design
Researchers propose PULSE, a framework for evaluating human-agent interactions in software engineering rather than relying solely on automated benchmarks. The framework combines human feedback with machine learning predictions to assess user satisfaction, revealing significant gaps between benchmark performance and real-world agent effectiveness across 15,000 users.
Current evaluation methods for LLM-powered agents prioritize automation metrics while overlooking the collaborative dynamics that define practical deployment. PULSE addresses this fundamental mismatch by centering human satisfaction in agent assessment, acknowledging that real-world software engineering remains deeply interactive rather than fully autonomous. This shift reflects growing recognition that benchmark scores—often optimized for controlled environments—poorly predict actual user experience and agent utility.
The study's deployment across 15,000 developers using OpenHands reveals critical insights: agent design decisions significantly impact satisfaction in unpredictable ways, with real-world performance sometimes contradicting benchmark rankings. The researchers found anti-correlation between claude-sonnet-4 and gpt-5 performance in practice versus controlled tests, suggesting benchmark gaming or domain-specific limitations. PULSE's ability to reduce confidence intervals by 40% compared to standard A/B testing indicates more statistically robust evaluation methodology.
For the AI development community, this research undermines the current benchmark-driven optimization paradigm that dominates model selection and agent design. Developers and enterprises may need to reconsider their reliance on published benchmarks when selecting AI agents for production use. The framework provides actionable guidance for future agent evaluation, emphasizing user-centric metrics that matter for real deployment scenarios. As AI agents proliferate across technical domains, rigorous human-centered evaluation becomes essential for identifying genuinely productive solutions rather than benchmark-optimized systems that underperform in practice.
- →PULSE framework reduces evaluation confidence intervals by 40% versus traditional A/B testing through human feedback combined with ML predictions
- →Real-world agent performance frequently contradicts published benchmark rankings, revealing significant limitations in current evaluation methods
- →Human-agent collaboration remains central to software engineering use cases, not full automation as benchmarks typically assume
- →Study of 15,000 developers identifies specific agent design decisions that substantially impact developer satisfaction rates
- →Framework demonstrates substantial gaps between in-the-wild results and benchmark performance, requiring reconsideration of model selection criteria