Exploiting Neural Audio Codec Latents for Adversarial Audio Attacks
Researchers demonstrate a novel adversarial attack method against audio classification systems by operating in the latent space of neural audio codecs, achieving 99% attack success rates with extremely low inference latency (sub-7ms). This approach significantly outperforms existing generative and optimization-based attack methods, revealing critical vulnerabilities in real-time audio security systems like speaker verification.
This research exposes a fundamental security gap in deep learning-based audio systems by introducing a computationally efficient method to generate adversarial attacks. The work addresses a persistent challenge in adversarial machine learning: existing optimization-based attacks like PGD and Carlini-Wagner require iterative updates in high-dimensional waveform space, making real-time threats impractical. Generative alternatives struggle with latency and perceptible artifacts that compromise attack realism.
The breakthrough leverages neural audio codec latent spaces, which compress audio information into lower-dimensional representations. By synthesizing adversarial perturbations in this compressed space and decoding them back to waveforms, the method achieves both speed and quality. The 24x latency reduction compared to competing generative approaches, combined with 99% targeted attack success rates, demonstrates practical viability for real-time threat scenarios.
This development carries significant implications for security-critical audio applications. Automatic speaker verification systems protect access to financial accounts, secure communications, and biometric authentication. The attack's speed and efficiency could enable new attack vectors against these systems in production environments. The research underscores that security assumptions around real-time audio processing may be fundamentally flawed.
Looking forward, this work will likely trigger defensive research into audio codec robustness and adversarial detection mechanisms. Organizations deploying audio-based authentication systems should reassess their threat models and consider implementing detection systems for adversarial perturbations. The research also highlights the need for codec designs that inherently resist such attacks, suggesting a new arms race in audio security.
- βNeural audio codec latent spaces enable efficient single-pass generation of adversarial audio attacks with sub-7ms inference latency.
- βThe method achieves 99% targeted attack success rates against speaker verification systems, outperforming optimization and generative baselines.
- βOperating in compressed latent space reduces computational overhead by 24x compared to competing generative attack methods.
- βAudio classification systems including speaker verification face practical real-time adversarial threats previously thought computationally infeasible.
- βSecurity-critical audio authentication systems require defensive updates to detect and mitigate latent-space adversarial perturbations.