TwinShift: Benchmarking Audio Deepfake Detection across Synthesizer and Speaker Shifts

View on arXiv ← Back to list

Authors: Jiyoung Hong, Yoonseo Chung, Seungyeon Oh, Juntae Kim, Jiyoung Lee, Sookyung Kim, Hyunsoo Cho

Published: 2025-10-27 08:06:07+00:00

AI Summary

This paper introduces TWINSHIFT, a new benchmark designed to rigorously evaluate the robustness and generalization ability of Audio Deepfake Detection (ADD) systems under strictly unseen conditions. TWINSHIFT evaluates detectors under simultaneous shifts in both the speech synthesizer and the speaker identity, using six different synthesis systems paired with disjoint sets of speakers. Experiments reveal significant robustness gaps, confirming that current SOTA detectors fail when confronted with truly novel deepfakes.

Abstract

Audio deepfakes pose a growing threat, already exploited in fraud and misinformation. A key challenge is ensuring detectors remain robust to unseen synthesis methods and diverse speakers, since generation techniques evolve quickly. Despite strong benchmark results, current systems struggle to generalize to new conditions limiting real-world reliability. To address this, we introduce TWINSHIFT, a benchmark explicitly designed to evaluate detection robustness under strictly unseen conditions. Our benchmark is constructed from six different synthesis systems, each paired with disjoint sets of speakers, allowing for a rigorous assessment of how well detectors generalize when both the generative model and the speaker identity change. Through extensive experiments, we show that TWINSHIFT reveals important robustness gaps, uncover overlooked limitations, and provide principled guidance for developing ADD systems. The TWINSHIFT benchmark can be accessed at https://github.com/intheMeantime/TWINSHIFT.

Key findings

Detector performance collapses sharply when evaluated under combined synthesizer and speaker shifts, even though performance is near-perfect on in-domain data. Generator mismatch is confirmed as the dominant source of degradation, and cross-environment transfer is found to be highly non-commutative. Training on high-fidelity generators does not guarantee broader robustness, suggesting that models overfit to narrow, generator-specific artifacts rather than learning generalizable cues.

Approach

The approach centers on creating and analyzing the TWINSHIFT benchmark, which measures generalization across simultaneous shifts in synthesizer and speaker identity. Existing state-of-the-art detectors are trained on one specific environment (synthesizer/speaker pair) and then rigorously evaluated on five other disjoint environments to assess cross-environment transferability. This methodology uses six diverse TTS/VC systems to generate challenging out-of-distribution deepfakes.

Datasets

TWINSHIFT (composed of ASVspoof 2019 LA train, In-the-Wild, Expresso, Emilia, LibriTTS train-clean-100, and spoofs generated by MeloTTS, ParlerTTS, ElevenLabs, HierSpeech++, F5-TTS, OZSpeech).

Model(s)

Se-Res2Net, RawNet2, AASIST, RawBmamba

Author countries

Republic of Korea

← Previous