Researchers present FormalASR, compact end-to-end models that convert spoken Chinese directly into formal written text, eliminating the need for post-processing with large language models. Built on newly created datasets and fine-tuned versions of Qwen3-ASR, the solution achieves significant error reduction while enabling lightweight on-device deployment.
FormalASR addresses a fundamental gap in automatic speech recognition technology by tackling the verbatim transcription problem. Traditional ASR systems capture speech exactly as spoken, including disfluencies, filler words, and informal grammatical structures that create friction in downstream applications requiring formal text. The conventional workaround—chaining ASR with LLM post-processing—introduces computational overhead, latency penalties, and deployment constraints that limit real-world applicability, especially for on-device solutions.
The technical approach reflects broader trends in specialized model development. Rather than relying on expensive multi-stage pipelines, the researchers created domain-specific datasets (WenetSpeech-Formal and Speechio-Formal) through LLM-assisted rewriting and quality filtering, then fine-tuned smaller base models at two scales. This methodology demonstrates how strategic dataset construction can enable smaller models to handle previously complex tasks. The 0.6B and 1.7B parameter models represent practical choices for deployment constraints while maintaining meaningful performance improvements—up to 37.4% relative CER reduction alongside improvements in semantic metrics like ROUGE-L and BERTScore.
For the AI development community, this work validates that end-to-end specialized models can outperform composition of general-purpose systems while reducing infrastructure requirements. This matters significantly for commercial ASR applications in Chinese markets, where formal text output directly enables downstream applications in legal documentation, business communication, and content generation. The on-device capability addresses privacy and latency concerns critical for enterprise adoption.
Future developments will likely focus on expanding this approach to other languages and speech-to-text domains where verbatim output creates friction. The methodological insights around dataset construction and model specialization apply broadly across multimodal AI systems.
- →FormalASR achieves 37.4% relative character error reduction over verbatim ASR baselines by transforming spoken Chinese into formal written text in a single end-to-end model.
- →Two compact models (0.6B and 1.7B parameters) eliminate the need for post-processing LLMs, enabling lightweight on-device deployment with lower latency and memory requirements.
- →Custom datasets created through LLM-based rewriting and quality filtering prove effective for training specialized speech-to-formal-text systems.
- →Semantic metrics including ROUGE-L and BERTScore improve alongside error metrics, indicating the models capture meaning transformation not just transcription accuracy.
- →The approach demonstrates that smaller specialized models can outperform larger general-purpose pipeline compositions for specific language tasks.