AIBullisharXiv – CS AI · 5d ago7/10
🧠Researchers introduce Listen-Write-Speak (LWS), a new paradigm for speech-based large language models that enables simultaneous text output alongside spoken responses. The approach leverages a single autoregressive LLM with a Token Schema to unlock text-native capabilities like code generation and structured analysis in real-time conversational AI without architectural modifications.
AINeutralarXiv – CS AI · Jun 27/10
🧠Researchers introduce PolySpeech-100, a comprehensive benchmark evaluating speech understanding across 110 languages and dialects, revealing that end-to-end speech-LLMs outperform traditional ASR+LLM systems on dialects but struggle with low-resource languages. The study of 22 state-of-the-art models exposes significant performance gaps and shows that chain-of-thought prompting often degrades speech comprehension, highlighting critical modality alignment issues in current AI architectures.
🧠 Gemini
AIBullisharXiv – CS AI · 5d ago6/10
🧠LEAF (Low-rank Exploration with Adaptive Forking) introduces a novel tree-based reinforcement learning method for training speech-aware large language models that improves credit assignment by identifying shared response prefixes and assigning rewards at the span level rather than uniformly across tokens. The approach achieves superior performance compared to existing GRPO-style methods without requiring additional computational overhead, enabling smaller models to match or exceed larger baselines.
AINeutralarXiv – CS AI · Jun 16/10
🧠Researchers introduce SURE, a unified experimentation framework that standardizes evaluation metrics and training pipelines for speech understanding models, addressing reproducibility challenges that have hindered fair comparison of speech foundation models and Speech LLMs across different deployment scenarios.
AIBullisharXiv – CS AI · Mar 276/10
🧠Researchers propose X-OPD, a Cross-Modal On-Policy Distillation framework to improve Speech Large Language Models by aligning them with text-based counterparts. The method uses token-level feedback from teacher models to bridge performance gaps in end-to-end speech systems while preserving inherent capabilities.
AIBearisharXiv – CS AI · Mar 96/10
🧠Research reveals that speech LLMs don't perform significantly better than traditional ASR→LLM pipelines in most deployed scenarios. The study shows speech LLMs essentially function as expensive cascades that perform worse under noisy conditions, with advantages reversing by up to 7.6% at 0dB noise levels.
$LLM
AINeutralarXiv – CS AI · Mar 114/10
🧠Researchers introduce VoxEmo, a comprehensive benchmark for evaluating Speech Large Language Models on emotion recognition tasks across 35 emotion corpora and 15 languages. The benchmark addresses evaluation challenges in open text generation and introduces novel protocols that better align with human subjective emotion perception.