TwinShift: Benchmarking Audio Deepfake Detection across Synthesizer and Speaker Shifts

Authors: Jiyoung Hong, Yoonseo Chung, Seungyeon Oh, Juntae Kim, Jiyoung Lee, Sookyung Kim, Hyunsoo Cho

Published: 2025-10-27 08:06:07+00:00

Comment: Submitted to ICASSP 2026

AI Summary

This paper introduces TWINSHIFT, a novel benchmark designed to evaluate the robustness and generalization capabilities of audio deepfake detection (ADD) systems under strictly unseen conditions. It is constructed from six different synthesis systems, each paired with disjoint sets of speakers, allowing for rigorous assessment of detector performance when both the generative model and speaker identity change. TWINSHIFT reveals significant robustness gaps in current ADD systems and provides guidance for developing more resilient detectors.

Abstract

Audio deepfakes pose a growing threat, already exploited in fraud and misinformation. A key challenge is ensuring detectors remain robust to unseen synthesis methods and diverse speakers, since generation techniques evolve quickly. Despite strong benchmark results, current systems struggle to generalize to new conditions limiting real-world reliability. To address this, we introduce TWINSHIFT, a benchmark explicitly designed to evaluate detection robustness under strictly unseen conditions. Our benchmark is constructed from six different synthesis systems, each paired with disjoint sets of speakers, allowing for a rigorous assessment of how well detectors generalize when both the generative model and the speaker identity change. Through extensive experiments, we show that TWINSHIFT reveals important robustness gaps, uncover overlooked limitations, and provide principled guidance for developing ADD systems. The TWINSHIFT benchmark can be accessed at https://github.com/intheMeantime/TWINSHIFT.


Key findings
The benchmark revealed that while detectors achieve near-perfect accuracy within a single, seen environment, their performance collapses dramatically when evaluated on different, unseen environments (synthesizer and speaker shifts). There is no single detector architecture or training dataset that consistently provides resilience against these shifts, suggesting that robustness requires advancements across both model design and data strategies. Furthermore, transferability between different environments is highly non-commutative, and training on high-fidelity spoofs does not necessarily guarantee broad generalization.
Approach
The authors introduce TWINSHIFT, a new benchmark that systematically evaluates audio deepfake detection systems by simulating real-world unseen conditions across two orthogonal axes: synthesis model and speaker identity. It comprises six mutually disjoint environments, each pairing a dedicated bonafide dataset with a specific spoofing system, ensuring no overlap in speakers or synthesizers between training and evaluation splits. This setup allows for measuring both within-environment performance and cross-environment transferability to diagnose generalization failures.
Datasets
ASVspoof’19 LA train, In-the-Wild, Expresso, Emilia, LibriTTS train-clean-100 (for bonafide samples); MeloTTS, HierSpeech++, ParlerTTS, F5-TTS, OZSpeech, ElevenLabs API (for spoof generation). The benchmark itself is named TWINSHIFT.
Model(s)
Se-Res2Net, RawNet2, AASIST, RawBMamba
Author countries
Republic of Korea