IHBench: Evaluating Post-Interruption Recovery in Voice Agents with Structured Workflows
Researchers introduce IHBench, a benchmark for evaluating how voice agents recover from user interruptions while executing multi-step workflows in enterprise settings. Testing 27 model configurations reveals closed-weight models (OpenAI, Google) significantly outperform open-weight alternatives in handling interruptions, recovering 3.3x more gracefully and maintaining task completion rates.
IHBench addresses a critical gap in voice agent evaluation by measuring post-interruption recovery rather than just interruption detection. Traditional speech benchmarks focus on barge-in timing and turn-taking, but fail to assess whether agents resume workflows correctly, acknowledge user interjections, or avoid redundant content delivery. This matters because real-world voice deployments in customer service, healthcare, and account management face constant interruptions, making recovery quality essential for user experience and task completion.
The benchmark's methodology is rigorous: injecting six interruption types at controlled utterance points across 10 enterprise domains, with domain-specific evaluation rubrics. The results reveal stark performance stratification. Closed-weight models demonstrate substantially better recovery across three dimensions: higher task fulfillment rates, slower performance degradation with longer conversations, and consistent audio-to-text performance parity. Open-weight models lose ground on all three metrics, suggesting architectural or training advantages in proprietary systems that the community has yet to replicate.
For the AI industry, these findings validate concerns about the gap between open and closed models for enterprise applications. The cross-benchmark analysis showing recovery quality as a distinct capability axis indicates that current benchmarks capture incomplete performance pictures. This has implications for model selection in production environments where interruption handling directly impacts customer satisfaction and operational efficiency.
Future work should focus on identifying specific architectural features or training methodologies that enable superior recovery in closed-weight models, allowing the open-source community to close this gap. As voice agents proliferate in enterprise workflows, recovery quality will increasingly differentiate production-ready systems from academic demonstrations.
- βIHBench measures post-interruption recovery in voice agents, revealing this capability is largely distinct from other speech understanding benchmarks
- βClosed-weight models from OpenAI and Google consistently outperform open-weight alternatives across task fulfillment and recovery robustness
- βPerformance degradation slows 3.3x more gradually for proprietary models as conversation length increases, suggesting superior context management
- βOpen-weight models show significant audio-to-text modality gaps while closed-weight models maintain parity, indicating fundamental architectural differences
- βRecovery quality is now validated as a critical evaluation axis for production voice agent deployment in enterprise settings