EchoFake: A Replay-Aware Dataset for Practical Speech Deepfake Detection

Authors: Tong Zhang, Yihuan Huang, Yanzhen Ren

Published: 2025-10-22 09:34:31+00:00

AI Summary

This paper introduces EchoFake, a novel dataset designed to address the vulnerability of speech deepfake detection systems to physical replay attacks, which often bypass models trained on lab-generated synthetic speech. EchoFake provides over 120 hours of audio including cutting-edge zero-shot text-to-speech (TTS) and physical replay recordings under varied real-world conditions. Experiments show that models trained on EchoFake achieve lower average Equal Error Rates (EERs) and better generalization across datasets, highlighting its value for advancing robust anti-spoofing methods.

Abstract

The growing prevalence of speech deepfakes has raised serious concerns, particularly in real-world scenarios such as telephone fraud and identity theft. While many anti-spoofing systems have demonstrated promising performance on lab-generated synthetic speech, they often fail when confronted with physical replay attacks-a common and low-cost form of attack used in practical settings. Our experiments show that models trained on existing datasets exhibit severe performance degradation, with average accuracy dropping to 59.6% when evaluated on replayed audio. To bridge this gap, we present EchoFake, a comprehensive dataset comprising more than 120 hours of audio from over 13,000 speakers, featuring both cutting-edge zero-shot text-to-speech (TTS) speech and physical replay recordings collected under varied devices and real-world environmental settings. Additionally, we evaluate three baseline detection models and show that models trained on EchoFake achieve lower average EERs across datasets, indicating better generalization. By introducing more practical challenges relevant to real-world deployment, EchoFake offers a more realistic foundation for advancing spoofing detection methods.


Key findings
Existing anti-spoofing models trained on conventional datasets suffer significant performance degradation, with average accuracy dropping to 59.6% and high EERs, when evaluated on replayed audio, especially in open-set conditions. Models trained on the EchoFake dataset demonstrate better cross-dataset generalization, achieving lower average EERs compared to models trained on other benchmarks. Incorporating replay data during training significantly enhances robustness against replay-based attacks without degrading performance on conventional benchmarks.
Approach
The authors construct EchoFake, a new dataset integrating both zero-shot TTS deepfakes and diverse physical replay recordings. This dataset simulates realistic attack scenarios by varying playback/recording devices, environmental conditions, and microphone-speaker distances to create bona fide, replayed bona fide, fake, and replayed fake audio samples. They then evaluate baseline detection models on this dataset to demonstrate its practical challenges and benefits for generalization.
Datasets
EchoFake, CommonVoice 17.0 (for bona fide samples), ASVspoof 2019 LA, ASVspoof 2021 LA, ASVspoof 2021 DF, In-the-Wild, WaveFake.
Model(s)
RawNet2, AASIST, Wav2Vec2
Author countries
China