Eroding Trust in Real Speech: A Large-Scale Study of Human Audio Deepfake Perception
A comprehensive listening study of 1,768 participants reveals that while humans remain similarly accurate at detecting fake audio (71.2%), they have significantly eroded trust in authentic speech, with real sample detection dropping from 72.7% to 64.1% compared to 2021 baselines. Modern commercial and language model-generated deepfakes pose the greatest challenge to human perception, though ML detectors maintain >94.5% accuracy across all conditions.
This research addresses a critical blind spot in deepfake discourse: the psychological impact of synthetic media on human trust rather than technical deception capability alone. The study's scale—35,532 judgments across 138 systems—provides robust statistical evidence that deepfake proliferation fundamentally alters how people evaluate genuine content, not just their ability to spot fakes.
The skepticism shift represents a significant departure from traditional threat models. Rather than deepfakes succeeding through technical sophistication, they succeed by poisoning the informational commons. When people cannot confidently authenticate legitimate speech, even accurate detection skills become psychologically useless. This mirrors documented patterns in misinformation research where uncertainty itself becomes weaponized.
The performance variance across model architectures carries important implications for content authentication. Commercial and autoregressive systems achieve 61-65.9% human detection accuracy, while traditional seq2seq and flow-matching models remain at 75.4-76.8%. This suggests that deployment decisions at scale—favoring commercial providers—directly impact trustworthiness perceptions. The 94.5%+ accuracy of ML detectors indicates that algorithmic solutions can compensate for human limitations, but only if implemented and trusted by users.
For stakeholders in authentication, platform governance, and AI deployment, this research suggests immediate priority for transparent detection systems and provenance verification mechanisms. The gap between human and machine accuracy will likely drive demand for audio watermarking, blockchain-based authentication, and standardized verification protocols. Organizations managing sensitive audio content face escalating liability if they cannot credibly verify speaker authenticity, even when content is genuine.
- →Human accuracy detecting real speech dropped 8.6 percentage points year-over-year while fake detection remained stable, indicating erosion of trust rather than improved synthesis.
- →Commercial and language model-based deepfakes (61.3-65.9% detection) significantly outperform traditional architectures (75.4-76.8%), suggesting deployment choices directly impact authentication difficulty.
- →The 94.5%+ accuracy of ML detectors reveals a critical gap where algorithmic solutions could compensate for human perceptual limitations if properly implemented.
- →Deepfakes threaten not through deception but through epistemic poisoning—making people distrust genuine content regardless of detection capability.
- →Content platforms and organizations handling sensitive audio need authentication infrastructure beyond human verification to maintain credibility in deepfake-saturated environments.