Why Speech Deepfake Detectors Won't Generalize: The Limits of Detection in an Open World

Authors: Visar Berisha, Prad Kadambi, Isabella Lenz

Published: 2025-09-23 20:27:04+00:00

AI Summary

Speech deepfake detectors fail to generalize in real-world conditions due to a combinatorial challenge termed 'coverage debt,' where required data grows faster than data collection. Analyzing cross-testing results, the authors demonstrate that detection performance drops significantly with newer synthesizers and in conversational speech domains. The study concludes that detection alone is insufficient for high-stakes decisions and must be integrated into layered defense strategies.

Abstract

Speech deepfake detectors are often evaluated on clean, benchmark-style conditions, but deployment occurs in an open world of shifting devices, sampling rates, codecs, environments, and attack families. This creates a ``coverage debt for AI-based detectors: every new condition multiplies with existing ones, producing data blind spots that grow faster than data can be collected. Because attackers can target these uncovered regions, worst-case performance (not average benchmark scores) determines security. To demonstrate the impact of the coverage debt problem, we analyze results from a recent cross-testing framework. Grouping performance by bona fide domain and spoof release year, two patterns emerge: newer synthesizers erase the legacy artifacts detectors rely on, and conversational speech domains (teleconferencing, interviews, social media) are consistently the hardest to secure. These findings show that detection alone should not be relied upon for high-stakes decisions. Detectors should be treated as auxiliary signals within layered defenses that include provenance, personhood credentials, and policy safeguards.


Key findings
Detection errors increase sharply for spoof families released post-2022/2024, demonstrating that newer synthesizers successfully erase artifacts older detectors rely on. Conversational speech settings (teleconferencing, interviews, social media) exhibit the largest degradation and are consistently the hardest domains to secure. These limitations suggest that robust defense requires moving beyond detector-only solutions.
Approach
The authors re-analyze results from a bona fide cross-testing framework that computed Equal Error Rate (EER) across multiple detector/dataset pairings. They regrouped these results based on two axes: the application-focused domain of the bona fide speech (e.g., conversational vs. read) and the release year of the spoof synthesizer. This analysis quantified the generalization gap under shifting deployment conditions.
Datasets
AMI IHM, AMI SDM, LibriSpeech (test-clean/test-other), VCTK 0.92, FakeAVCeleb-v1.2, In-The-Wild, EmoFake-EN, AV-Deepfake-1M. Spoof sets included ASVspoof 2019/2021, CodecFake, MLAAD-v3-EN, and LlamaPartialSpoof (covering >160 subsets overall).
Model(s)
Wav2Vec-SCL (using XLS-R as feature extractor). Wav2Vec-Conformer and Wav2Vec-TCM were also mentioned as detectors evaluated in the underlying framework.
Author countries
UNKNOWN