Wav2DF-TSL: Two-stage Learning with Efficient Pre-training and Hierarchical Experts Fusion for Robust Audio Deepfake Detection

Authors: Yunqi Hao, Yihao Chen, Minqiang Xu, Jianbo Zhan, Liang He, Lei Fang, Sian Fang, Lin Liu

Published: 2025-09-04 12:35:10+00:00

AI Summary

This paper introduces Wav2DF-TSL, a two-stage learning strategy for robust audio deepfake detection. It uses efficient pre-training with adapters to learn spoofed speech artifacts and a hierarchical adaptive mixture of experts (HA-MoE) for multi-level spoofing cue fusion, significantly outperforming state-of-the-art methods.

Abstract

In recent years, self-supervised learning (SSL) models have made significant progress in audio deepfake detection (ADD) tasks. However, existing SSL models mainly rely on large-scale real speech for pre-training and lack the learning of spoofed samples, which leads to susceptibility to domain bias during the fine-tuning process of the ADD task. To this end, we propose a two-stage learning strategy (Wav2DF-TSL) based on pre-training and hierarchical expert fusion for robust audio deepfake detection. In the pre-training stage, we use adapters to efficiently learn artifacts from 3000 hours of unlabelled spoofed speech, improving the adaptability of front-end features while mitigating catastrophic forgetting. In the fine-tuning stage, we propose the hierarchical adaptive mixture of experts (HA-MoE) method to dynamically fuse multi-level spoofing cues through multi-expert collaboration with gated routing. Experimental results show that the proposed method significantly outperforms the baseline system on all four benchmark datasets, especially on the cross-domain In-the-wild dataset, achieving a 27.5% relative improvement in equal error rate (EER), outperforming the existing state-of-the-art systems. Index Terms: audio deepfake detection, self-supervised learning, parameter-efficient fine-tuning, mixture of experts


Key findings
Wav2DF-TSL significantly outperforms baseline and state-of-the-art methods on four benchmark datasets, especially In-the-wild, achieving a 27.5% relative improvement in EER. The proposed two-stage learning strategy and HA-MoE are shown to be highly effective in improving robustness and generalization.
Approach
Wav2DF-TSL employs a two-stage approach. The first stage uses adapters for efficient pre-training on unlabeled spoofed speech to improve feature adaptability. The second stage utilizes a hierarchical adaptive mixture of experts (HA-MoE) to fuse multi-level spoofing cues from the pre-trained model.
Datasets
ASVSpoof 19LA, ASVSpoof 21LA, ASVSpoof 21DF, In-the-wild, AudioFake (a custom dataset of 3000 hours of unlabeled spoofed speech)
Model(s)
XLSR-0.3B (Wav2vec 2.0 based model), Hierarchical Adaptive Mixture of Experts (HA-MoE), AASIST classifier, LoRA, and convolutional adapters.
Author countries
China