TwinShift: Benchmarking Audio Deepfake Detection across Synthesizer and Speaker Shifts
Authors: Jiyoung Hong, Yoonseo Chung, Seungyeon Oh, Juntae Kim, Jiyoung Lee, Sookyung Kim, Hyunsoo Cho
Published: 2025-10-27 08:06:07+00:00
Comment: Submitted to ICASSP 2026
AI Summary
This paper introduces TWINSHIFT, a novel benchmark designed to evaluate the robustness and generalization capabilities of audio deepfake detection (ADD) systems under strictly unseen conditions. It is constructed from six different synthesis systems, each paired with disjoint sets of speakers, allowing for rigorous assessment of detector performance when both the generative model and speaker identity change. TWINSHIFT reveals significant robustness gaps in current ADD systems and provides guidance for developing more resilient detectors.
Abstract
Audio deepfakes pose a growing threat, already exploited in fraud and misinformation. A key challenge is ensuring detectors remain robust to unseen synthesis methods and diverse speakers, since generation techniques evolve quickly. Despite strong benchmark results, current systems struggle to generalize to new conditions limiting real-world reliability. To address this, we introduce TWINSHIFT, a benchmark explicitly designed to evaluate detection robustness under strictly unseen conditions. Our benchmark is constructed from six different synthesis systems, each paired with disjoint sets of speakers, allowing for a rigorous assessment of how well detectors generalize when both the generative model and the speaker identity change. Through extensive experiments, we show that TWINSHIFT reveals important robustness gaps, uncover overlooked limitations, and provide principled guidance for developing ADD systems. The TWINSHIFT benchmark can be accessed at https://github.com/intheMeantime/TWINSHIFT.