AIBullisharXiv – CS AI · 5h ago6/10
🧠
JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions
Researchers introduce JASTIN, an instruction-driven framework that combines frozen audio encoders with fine-tuned LLMs to evaluate generative audio models with zero-shot capabilities. The approach achieves state-of-the-art correlation with human ratings across speech, sound, and music evaluation tasks without task-specific retraining.