Fake Speech Wild: Detecting Deepfake Speech on Social Media Platform

Authors: Yuankun Xie, Ruibo Fu, Xiaopeng Wang, Zhiyong Wang, Ya Li, Zhengqi Wen, Haonnan Cheng, Long Ye

Published: 2025-08-14 11:56:30+00:00

AI Summary

This paper addresses the significant performance degradation of deepfake audio countermeasures (CMs) in cross-domain scenarios, particularly on social media. It introduces the Fake Speech Wild (FSW) dataset, comprising 254 hours of real and deepfake audio from four different media platforms. By establishing a benchmark with self-supervised learning (SSL)-based CMs and employing data augmentation strategies with joint training on public and FSW datasets, the research achieves an average equal error rate (EER) of 3.54% for real-world deepfake audio detection.

Abstract

The rapid advancement of speech generation technology has led to the widespread proliferation of deepfake speech across social media platforms. While deepfake audio countermeasures (CMs) achieve promising results on public datasets, their performance degrades significantly in cross-domain scenarios. To advance CMs for real-world deepfake detection, we first propose the Fake Speech Wild (FSW) dataset, which includes 254 hours of real and deepfake audio from four different media platforms, focusing on social media. As CMs, we establish a benchmark using public datasets and advanced selfsupervised learning (SSL)-based CMs to evaluate current CMs in real-world scenarios. We also assess the effectiveness of data augmentation strategies in enhancing CM robustness for detecting deepfake speech on social media. Finally, by augmenting public datasets and incorporating the FSW training set, we significantly advanced real-world deepfake audio detection performance, achieving an average equal error rate (EER) of 3.54% across all evaluation sets.


Key findings
Publicly trained CMs exhibit significant performance degradation on wild datasets like ITW and FSW, emphasizing the generalization issue in real-world scenarios. Data augmentation strategies, particularly MUSAN & RIR, significantly enhance CM robustness on cross-domain datasets. The most effective approach, involving joint training of MR-augmented public datasets and the FSW training set with XLSR-AASIST, achieves an average EER of 3.54% across all evaluation sets, setting a new benchmark for real-world deepfake audio detection.
Approach
The approach involves constructing the Fake Speech Wild (FSW) dataset from social media platforms to address real-world deepfake detection challenges. A benchmark is established using advanced SSL-based countermeasures, and the effectiveness of noise data augmentation strategies (MUSAN & RIR, Rawboost) is assessed. Optimal performance is achieved by jointly training augmented public datasets with the FSW training set.
Datasets
Fake Speech Wild (FSW), ASVspoof2019LA (19LA), Codecfake, CFAD, In the Wild (ITW), MUSAN, RIR
Model(s)
AASIST, WavLM-AASIST (using WavLM-large), XLSR-AASIST (using Wav2Vec-XLS-R)
Author countries
China