On Deepfake Voice Detection -- It's All in the Presentation

Authors: Héctor Delgado, Giorgio Ramondetti, Emanuele Dalmasso, Gennady Karvitsky, Daniele Colibro, Haydar Talib

Published: 2025-09-30 16:19:51+00:00

Comment: Submitted to IEEE ICASSP 2026. Paper resources available at https://github.com/CavoloFrattale/deepfake-detection-test-protocol

AI Summary

This paper highlights how current deepfake datasets and research methodologies lead to systems that fail to generalize to real-world applications due to the lack of realistic presentation. The authors propose a new framework for data creation and research methodology that incorporates the effects of deepfake audio being presented through communication channels. By following these guidelines, they significantly improved deepfake detection accuracy in robust lab setups and real-world benchmarks, demonstrating that dataset quality is more crucial than model size.

Abstract

While the technologies empowering malicious audio deepfakes have dramatically evolved in recent years due to generative AI advances, the same cannot be said of global research into spoofing (deepfake) countermeasures. This paper highlights how current deepfake datasets and research methodologies led to systems that failed to generalize to real world application. The main reason is due to the difference between raw deepfake audio, and deepfake audio that has been presented through a communication channel, e.g. by phone. We propose a new framework for data creation and research methodology, allowing for the development of spoofing countermeasures that would be more effective in real-world scenarios. By following the guidelines outlined here we improved deepfake detection accuracy by 39% in more robust and realistic lab setups, and by 57% on a real-world benchmark. We also demonstrate how improvement in datasets would have a bigger impact on deepfake detection accuracy than the choice of larger SOTA models would over smaller models; that is, it would be more important for the scientific community to make greater investment on comprehensive data collection programs than to simply train larger models with higher computational demands.


Key findings
The study found that incorporating realistic data, particularly by simulating how deepfakes are presented through communication channels, drastically improves deepfake detection systems' generalization to real-world applications (39% to 57% accuracy gain). This improvement from dataset realism was more impactful than using larger, more computationally expensive SOTA models. The lightweight logmel-ResNet-CoT model, when fully augmented with realistic and presented data, remained competitive with much larger WavLM-based systems.
Approach
The authors propose a new framework for creating more realistic deepfake detection datasets by simulating real-world attack scenarios, specifically incorporating 'presentation' effects like direct audio injection or loudspeaker playback into phone calls. They built and used a 'Fraud Academy' dataset replicating the full fraud attack sequence and augmented training data with vocoder-synthesized speech. This data is then used to train and evaluate various deepfake detection models, emphasizing dataset realism over brute-force model scaling.
Datasets
ASVspoof 2019 LA, ASVspoof 5, SWB-Synth/raw, MLS-Synth/raw, SWB-Synth/inj., SWB-Synth/play., MLS-Synth/inj., MLS-Synth/play., VoxCeleb 1&2, Fisher Spanish, Mixer6, Mixed Phone, RSR2015, SITW, SRE 16/18/19, Switchboard, ASVspoof 2021 LA, ASVspoof 2021 LA Hidden, ASVspoof 2021 DF, ASVspoof 5 w/o Encodec, In-the-wild, SpoofCeleb, Pool, Fraud Academy (Injected TTS, Playback TTS)
Model(s)
logmel-ResNet-CoT, WavLM-LLGF, WavLM-Nes2Net
Author countries
UNKNOWN