On Deepfake Voice Detection -- It's All in the Presentation

Authors: Héctor Delgado, Giorgio Ramondetti, Emanuele Dalmasso, Gennady Karvitsky, Daniele Colibro, Haydar Talib

Published: 2025-09-30 16:19:51+00:00

AI Summary

Current audio deepfake detection systems fail to generalize to real-world scenarios because existing datasets ignore the effects of presentation through communication channels (e.g., phone calls). The authors propose a new framework and research methodology incorporating these realistic presentation distortions into training data creation. This methodology significantly improved deepfake detection accuracy by 39% in robust lab setups and by 57% on a real-world benchmark, demonstrating the critical role of data realism over model size.

Abstract

While the technologies empowering malicious audio deepfakes have dramatically evolved in recent years due to generative AI advances, the same cannot be said of global research into spoofing (deepfake) countermeasures. This paper highlights how current deepfake datasets and research methodologies led to systems that failed to generalize to real world application. The main reason is due to the difference between raw deepfake audio, and deepfake audio that has been presented through a communication channel, e.g. by phone. We propose a new framework for data creation and research methodology, allowing for the development of spoofing countermeasures that would be more effective in real-world scenarios. By following the guidelines outlined here we improved deepfake detection accuracy by 39% in more robust and realistic lab setups, and by 57% on a real-world benchmark. We also demonstrate how improvement in datasets would have a bigger impact on deepfake detection accuracy than the choice of larger SOTA models would over smaller models; that is, it would be more important for the scientific community to make greater investment on comprehensive data collection programs than to simply train larger models with higher computational demands.


Key findings
Incorporating presentation realism into training data was found to be more impactful on deepfake detection accuracy than using larger models or standard augmentation techniques. The improved methodology resulted in detection accuracy gains of 39% (lab setup) and 57% (real-world benchmark) compared to training with conventional datasets. Furthermore, the lightweight logmel-ResNet-CoT system remained highly competitive with the much larger WavLM-based systems when utilizing the fully augmented and realistic training data.
Approach
The core approach involves introducing a new data creation framework that simulates the full deepfake attack sequence, particularly the 'presentation phase' where raw deepfake audio is processed through telephony networks via direct injection or loudspeaker playback. They train and evaluate SOTA deepfake detection systems (logmel-ResNet-CoT and WavLM-based models) using different combinations of realistic (presented) and non-realistic training data (Base, Augmented) to prove the hypothesis that data realism improves generalization.
Datasets
ASVspoof 2019 LA, ASVspoof 2021 LA, ASVspoof 2021 LA Hidden, ASVspoof 2021 DF, ASVspoof 5 w/o Encodec, In-the-wild, SpoofCeleb, Switchboard (SWB), Multilingual LibriSpeech (MLS), Fraud Academy (Realworld/private dataset), VoxCeleb 1&2, Fisher Spanish, Mixer6, Mixed Phone, RSR2015, SITW, SRE 16/18/19.
Model(s)
logmel-ResNet-CoT (Residual Network with Contextual Transformers), WavLM-LLGF (WavLM frontend + LCNN/bi-LSTM backend), WavLM-Nes2Net (WavLM frontend + Nested Res2Net backend), WavLM Large (pre-trained SSL model).
Author countries
UNKNOWN