AINeutralarXiv – CS AI · 7h ago6/10
🧠
How can we assess human-agent interactions? Case studies in software agent design
Researchers propose PULSE, a framework for evaluating human-agent interactions in software engineering rather than relying solely on automated benchmarks. The framework combines human feedback with machine learning predictions to assess user satisfaction, revealing significant gaps between benchmark performance and real-world agent effectiveness across 15,000 users.
🧠 GPT-5