A Data-Centric Approach to Generalizable Speech Deepfake Detection

Authors: Wen Huang, Yuchen Mao, Yanmin Qian

Published: 2025-12-20 04:28:33+00:00

AI Summary

The paper proposes a data-centric approach to improve generalizable Speech Deepfake Detection (SDD), demonstrating that diversity in source speakers and synthetic generators is more critical than raw data volume. They introduce the Diversity-Optimized Sampling Strategy (DOSS), a principled framework for mixing heterogeneous data via pruning (DOSS-Select) or re-weighting (DOSS-Weight). The optimal DOSS-Weight strategy achieves state-of-the-art generalization performance on public benchmarks and a new commercial API challenge set with high data and model efficiency.

Abstract

Achieving robust generalization in speech deepfake detection (SDD) remains a primary challenge, as models often fail to detect unseen forgery methods. While research has focused on model-centric and algorithm-centric solutions, the impact of data composition is often underexplored. This paper proposes a data-centric approach, analyzing the SDD data landscape from two practical perspectives: constructing a single dataset and aggregating multiple datasets. To address the first perspective, we conduct a large-scale empirical study to characterize the data scaling laws for SDD, quantifying the impact of source and generator diversity. To address the second, we propose the Diversity-Optimized Sampling Strategy (DOSS), a principled framework for mixing heterogeneous data with two implementations: DOSS-Select (pruning) and DOSS-Weight (re-weighting). Our experiments show that DOSS-Select outperforms the naive aggregation baseline while using only 3% of the total available data. Furthermore, our final model, trained on a 12k-hour curated data pool using the optimal DOSS-Weight strategy, achieves state-of-the-art performance, outperforming large-scale baselines with greater data and model efficiency on both public benchmarks and a new challenge set of various commercial APIs.


Key findings
Diversity is the primary driver of generalization, following predictable power laws, and investing in generator diversity improves identification while source diversity improves discrimination. DOSS-Select achieved efficiency, outperforming naive aggregation while using only 3% of the data. The final DOSS-Weight trained model (XLS-R-1B, 12k hours) surpassed a large-scale baseline (XLS-R-2B, 74k hours) on public benchmarks and demonstrated superior robustness on commercial APIs.
Approach
They first characterized data scaling laws, finding that source and generator diversity are the primary drivers of generalization. They then proposed the Diversity-Optimized Sampling Strategy (DOSS) to manage heterogeneous training pools by enforcing a near-uniform distribution across fine-grained domains (source and generator combinations). DOSS is implemented as DOSS-Select (pruning) and DOSS-Weight (sampling re-weighting).
Datasets
A curated 12k-hour data pool aggregating 17 public SDD datasets (e.g., ASVspoof, ADD, SpeechFake) and self-generated data. Evaluation uses public benchmarks and a new challenge set generated from 9 commercial TTS APIs (e.g., Google, OpenAI, Qwen3).
Model(s)
XLS-R (300M and 1B parameter versions) self-supervised backbone fine-tuned with a temporal average pooling layer and an MLP classifier head.
Author countries
China