Why Speech Deepfake Detectors Won't Generalize: The Limits of Detection in an Open World

Authors: Visar Berisha, Prad Kadambi, Isabella Lenz

Published: 2025-09-23 20:27:04+00:00

AI Summary

This paper argues that speech deepfake detectors fail to generalize in real-world, open-world conditions due to a 'coverage debt' caused by multiplicatively growing factors like devices, codecs, and attack families. Through an analysis of a cross-testing framework, the authors demonstrate that detectors struggle significantly with newer synthesizers and conversational speech domains. They conclude that detection alone is insufficient for high-stakes decisions and advocate for layered defenses.

Abstract

Speech deepfake detectors are often evaluated on clean, benchmark-style conditions, but deployment occurs in an open world of shifting devices, sampling rates, codecs, environments, and attack families. This creates a ``coverage debt for AI-based detectors: every new condition multiplies with existing ones, producing data blind spots that grow faster than data can be collected. Because attackers can target these uncovered regions, worst-case performance (not average benchmark scores) determines security. To demonstrate the impact of the coverage debt problem, we analyze results from a recent cross-testing framework. Grouping performance by bona fide domain and spoof release year, two patterns emerge: newer synthesizers erase the legacy artifacts detectors rely on, and conversational speech domains (teleconferencing, interviews, social media) are consistently the hardest to secure. These findings show that detection alone should not be relied upon for high-stakes decisions. Detectors should be treated as auxiliary signals within layered defenses that include provenance, personhood credentials, and policy safeguards.


Key findings
Detector errors sharply increase for post-2022 synthesizers, as newer systems erase legacy artifacts that detectors rely on. Conversational speech domains (teleconferencing, interviews, social media) consistently exhibit the largest performance degradation compared to read speech. The study highlights that detection alone is insufficient for reliable deepfake mitigation in open-world scenarios due to both epistemic (coverage debt) and aleatoric (irreducible overlap) uncertainties.
Approach
The authors analyze cross-testing results from an existing evaluation framework, focusing on the highest-performing detector (Wav2Vec-SCL). They regroup performance metrics (EER) by bona fide speech domain (e.g., teleconferencing, audiobooks) and spoof release year (2019-2024) to assess generalization under evolving real-world conditions.
Datasets
AMI IHM, AMI SDM, LibriSpeech test-clean, LibriSpeech test-other, VCTK 0.92, FakeAVCeleb-v1.2, In-The-Wild, EmoFake-EN, AV-Deepfake-1M, ASVspoof 2019 LA, ASVspoof 2021 DF, CodecFake, MLAAD-v3-EN, LlamaPartialSpoof
Model(s)
Wav2Vec-SCL (using XLS-R as a feature extractor)
Author countries
UNKNOWN