Diffusion Reconstruction towards Generalizable Audio Deepfake Detection

Authors: Bo Cheng, Songjun Cao, Xiaoming Zhang, Jie Chen, Long Ma, Fei Chen

Published: 2026-04-29 09:21:26+00:00

Comment: 5 pages, this paper was submitted to Interspeech2026 for review

AI Summary

This paper addresses the challenge of robust generalization in Audio Deepfake Detection (ADD) by proposing a framework centered on hard sample classification. It leverages diffusion-based reconstruction to generate challenging samples and enhances generalizability through multi-layer feature aggregation and a Regularization-Assisted Contrastive Learning (RACL) objective. Experiments demonstrate that this approach achieves superior generalization and significantly reduces the average Equal Error Rate compared to baselines.

Abstract

Achieving robust generalization against unseen attacks remains a challenge in Audio Deepfake Detection (ADD), driven by the rapid evolution of generative models. To address this, we propose a framework centered on hard sample classification. The core idea is that a model capable of distinguishing challenging hard samples is inherently equipped to handle simpler cases effectively. We investigate multiple reconstruction paradigms, identifying the diffusion-based method as optimal for generating hard samples. Furthermore, we leverage multi-layer feature aggregation and introduce a Regularization-Assisted Contrastive Learning (RACL) objective to enhance generalizability. Experiments demonstrate the superior generalization of our approach, with our best model achieving a significant reduction in the average Equal Error Rate (EER) compared to the baseline.


Key findings
The diffusion-based reconstruction method proved optimal for generating hard samples, leading to superior generalization across diverse unseen attacks. The full framework, integrating multi-layer aggregation and Regularization-Assisted Contrastive Learning (RACL) with diffusion reconstruction, achieved a significant 22.604% relative reduction in average EER compared to the baseline. Ablation studies confirmed that RACL enhances feature compactness and better separates hard samples, improving the model's generalizability.
Approach
The approach centers on hard sample classification, generating challenging audio deepfake samples using diffusion-based reconstruction (specifically SemantiCodec). For detection, it extracts features via a frozen XLS-R 300M, which are then processed by an AASIST classifier, incorporating multi-layer feature aggregation and a Regularization-Assisted Contrastive Learning (RACL) objective to enhance generalizability.
Datasets
ASVspoof 2019 LA eval, CodecFake, DiffSSD, WaveFake, ITW
Model(s)
XLS-R 300M, AASIST
Author countries
China