Dual-Branch Gated Fusion for Open-Set Audio Deepfake Source Tracing
Researchers propose a dual-branch gated fusion framework to identify the source of synthetic audio deepfakes, combining XLSR-53 with CORES descriptors to achieve 97.6% accuracy on in-domain tests and superior generalization to unseen synthesizers. The approach addresses a critical security gap where existing closed-set models fail to reject unknown audio generation systems.
Audio deepfake attribution represents a growing security challenge as synthetic speech generation becomes increasingly accessible and convincing. Current systems trained on specific synthesizers fail catastrophically when encountering novel generation methods, producing false positives that undermine trust in audio authentication. This research tackles the open-set problem—identifying whether an audio sample comes from a known or unknown source—which has direct implications for forensic analysis, misinformation detection, and authentication systems.
The technical innovation centers on recognizing that different feature representations capture complementary information. While XLSR-53, a self-supervised learning model, excels at distinguishing known synthesizers, it overfits to training data. CORES, a hand-crafted descriptor spanning multiple acoustic dimensions, proves more robust to distribution shifts but lacks the discriminative power of deep learning. Rather than simple feature concatenation, the authors implement adaptive gating that learns to weight each branch based on input characteristics, avoiding representational conflicts that plagued earlier fusion attempts.
For the AI security and media forensics industries, this work advances authentication capabilities critical for detecting manipulated voice content in high-stakes contexts—financial fraud, political deepfakes, and identity verification. The 83.5% relative reduction in false positive rates at the 95% detection threshold directly improves practical deployment viability. Organizations building content verification systems gain a more generalizable approach to detect novel synthesis methods as they emerge.
Future development likely focuses on real-world robustness against compression, background noise, and adversarial attacks designed to evade detection. Integration with multimodal systems combining audio and video deepfake detection remains an open frontier.
- →Dual-branch architecture with adaptive gating outperforms naive feature concatenation for audio source attribution
- →System achieves 97.6% accuracy on known synthesizers while maintaining strong generalization to unseen audio generation methods
- →CORES descriptor's multi-dimensional approach (cepstral, spectral, rhythmic) captures synthesis artifacts missed by filter-bank-only methods
- →Energy margin loss and gate diversity regularization improve separation between in-distribution and out-of-distribution audio samples
- →83.5% reduction in false positive rate at 95% detection threshold enables practical forensic authentication applications