🧠 AI⚪ NeutralImportance 4/10

VoxEmo: Benchmarking Speech Emotion Recognition with Speech LLMs

arXiv – CS AI|Hezhao Zhang, Huang-Cheng Chou, Shrikanth Narayanan, Thomas Hain|March 11, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce VoxEmo, a comprehensive benchmark for evaluating Speech Large Language Models on emotion recognition tasks across 35 emotion corpora and 15 languages. The benchmark addresses evaluation challenges in open text generation and introduces novel protocols that better align with human subjective emotion perception.

Key Takeaways

→VoxEmo benchmark covers 35 emotion corpora across 15 languages for testing Speech LLMs on emotion recognition.
→The benchmark introduces standardized toolkits with varying prompt complexities from classification to paralinguistic reasoning.
→A distribution-aware soft-label protocol and prompt-ensemble strategy are introduced to emulate human annotator disagreement.
→Zero-shot speech LLMs show lower hard-label accuracy than supervised baselines but better align with human subjective distributions.
→The research addresses evaluation challenges when shifting from closed-set classification to open text generation in emotion recognition.