CORTIS: Text-Only Adaptation of Spoken Language Models for Task-Oriented Voice Agents
Researchers introduce CORTIS, a framework that enables spoken language models (SLMs) to handle task-oriented voice agent functions using only text-based training data, eliminating the need for expensive paired speech-target annotations. The approach matches or outperforms traditional ASR-LLM cascades while demonstrating superior robustness under acoustic degradation.
CORTIS addresses a critical bottleneck in deploying task-oriented voice agents: the expense and complexity of collecting paired speech-target training data. By enabling SLMs to learn task semantics from text-only supervision and apply them to speech at inference time, the framework reduces development friction for voice AI applications. This approach represents a pragmatic middle ground between cascaded ASR-LLM systems, which suffer from transcription error propagation in noisy environments, and fully speech-supervised models, which demand costly annotation efforts.
The technical achievement stems from a growing recognition that modern language models can transfer knowledge across modalities more effectively than previously assumed. Rather than requiring explicit speech-task pairs, CORTIS leverages the alignment already learned by SLMs during pre-training, fine-tuning them on structured task outputs presented in text form. This methodology aligns with broader trends in multimodal AI where single models handle multiple input types through unified representations.
For the voice AI industry, CORTIS potentially lowers barriers to entry for smaller teams and organizations building task-oriented agents. Companies developing voice-controlled systems for customer service, smart home environments, or enterprise applications could significantly reduce training data collection costs. The demonstrated resilience under acoustic degradation is particularly valuable for real-world deployments where background noise, accents, and speech variations remain persistent challenges.
The competitive parity with ASR-LLM baselines suggests SLMs may represent the optimal architecture for task-oriented voice work going forward. Future development will likely focus on scaling this text-only adaptation approach across more specialized domains and refining performance on edge cases where semantic preservation remains critical.
- βCORTIS enables task-oriented voice agents to be trained with text-only supervision, eliminating the need for expensive paired speech-target annotations.
- βThe framework matches ASR-LLM cascade performance while offering superior robustness under noisy acoustic conditions.
- βText-only fine-tuning of SLMs demonstrates effective knowledge transfer for task semantics without explicit speech-task pairs.
- βThe approach significantly reduces development costs for voice AI applications by removing the bottleneck of speech data collection.
- βResults suggest spoken language models may be preferable to cascaded ASR-LLM systems for task-oriented voice agent deployment.