AINeutralarXiv – CS AI · 7h ago6/10
🧠
VocSim: A Training-free Benchmark for Zero-shot Content Identity in Single-source Audio
Researchers introduce VocSim, a training-free benchmark for evaluating audio embeddings' ability to identify content across diverse sound sources without parameter updates or labeled data. Testing 125k clips spanning speech, animal vocalizations, and environmental sounds, the study reveals that while frozen Whisper embeddings perform well overall, significant generalization gaps exist for low-resource and non-English languages, with implications for audio AI model development.