y0news
← Feed
Back to feed
🧠 AI NeutralImportance 6/10

How can we assess human-agent interactions? Case studies in software agent design

arXiv – CS AI|Valerie Chen, Rohit Malhotra, Xingyao Wang, Juan Michelini, Xuhui Zhou, Aditya Bharat Soni, Hoang H. Tran, Calvin Smith, Ameet Talwalkar, Graham Neubig|
🤖AI Summary

Researchers propose PULSE, a framework for evaluating human-agent interactions in software engineering rather than relying solely on automated benchmarks. The framework combines human feedback with machine learning predictions to assess user satisfaction, revealing significant gaps between benchmark performance and real-world agent effectiveness across 15,000 users.

Analysis

Current evaluation methods for LLM-powered agents prioritize automation metrics while overlooking the collaborative dynamics that define practical deployment. PULSE addresses this fundamental mismatch by centering human satisfaction in agent assessment, acknowledging that real-world software engineering remains deeply interactive rather than fully autonomous. This shift reflects growing recognition that benchmark scores—often optimized for controlled environments—poorly predict actual user experience and agent utility.

The study's deployment across 15,000 developers using OpenHands reveals critical insights: agent design decisions significantly impact satisfaction in unpredictable ways, with real-world performance sometimes contradicting benchmark rankings. The researchers found anti-correlation between claude-sonnet-4 and gpt-5 performance in practice versus controlled tests, suggesting benchmark gaming or domain-specific limitations. PULSE's ability to reduce confidence intervals by 40% compared to standard A/B testing indicates more statistically robust evaluation methodology.

For the AI development community, this research undermines the current benchmark-driven optimization paradigm that dominates model selection and agent design. Developers and enterprises may need to reconsider their reliance on published benchmarks when selecting AI agents for production use. The framework provides actionable guidance for future agent evaluation, emphasizing user-centric metrics that matter for real deployment scenarios. As AI agents proliferate across technical domains, rigorous human-centered evaluation becomes essential for identifying genuinely productive solutions rather than benchmark-optimized systems that underperform in practice.

Key Takeaways
  • PULSE framework reduces evaluation confidence intervals by 40% versus traditional A/B testing through human feedback combined with ML predictions
  • Real-world agent performance frequently contradicts published benchmark rankings, revealing significant limitations in current evaluation methods
  • Human-agent collaboration remains central to software engineering use cases, not full automation as benchmarks typically assume
  • Study of 15,000 developers identifies specific agent design decisions that substantially impact developer satisfaction rates
  • Framework demonstrates substantial gaps between in-the-wild results and benchmark performance, requiring reconsideration of model selection criteria
Mentioned in AI
Models
GPT-5OpenAI
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles