Linguistically Augmented Audio Speech Data (LinguAS)
Researchers introduce LinguAS, a dataset of 800+ audio samples annotated with linguistic features to improve detection of deepfaked and spoofed speech. Models trained on this linguistically-augmented data significantly outperform existing deepfake detection baselines, addressing a critical gap in audio forensics.
The proliferation of synthetic audio and deepfake technology has created an urgent need for robust detection mechanisms. Traditional audio forensics relies primarily on frame-level acoustic analysis, missing important linguistic patterns that distinguish genuine human speech from machine-generated audio. LinguAS addresses this methodological gap by introducing Expert-Defined Linguistic Features (EDLFs) that capture natural speech characteristics at larger timescales, fundamentally changing how detection models process audio data.
The dataset represents a meaningful advancement in audio forensics research. With balanced samples across four spoofed attack types and genuine speech, plus granular metadata on speaker gender and audio source, LinguAS provides researchers with a more realistic training environment. The finding that EDLF-augmented models substantially exceed ASVspoof 2021 baselines and self-supervised learning approaches like HuBert and XLSR demonstrates the practical value of linguistic information in synthetic speech detection.
This work carries implications across multiple sectors. For cybersecurity professionals and platform operators, improved deepfake detection directly reduces risks from audio-based fraud, impersonation, and disinformation campaigns. For AI researchers, the dataset establishes linguistic features as a viable augmentation strategy, potentially influencing how future detection systems combine acoustic and linguistic modalities. The public release of data and code accelerates community progress on a rapidly evolving threat.
Looking forward, the critical challenge involves scaling these linguistic approaches to languages beyond English and maintaining detection performance as adversarial techniques evolve. Research should explore how linguistic features perform against emerging attack methods and whether the EDLF methodology generalizes across diverse speech patterns and accents.
- βLinguAS dataset combines 800+ audio samples with linguistic feature annotations to improve deepfake detection performance.
- βModels trained on linguistic features significantly outperform existing ASVspoof 2021 and self-supervised learning baselines.
- βThe dataset includes balanced spoofing attack types and granular metadata on speaker gender and audio source.
- βExpert-Defined Linguistic Features capture natural human speech patterns that frame-level acoustic features alone cannot detect.
- βPublic availability of data and code enables broader research community engagement with linguistic-based audio forensics.