AFSS: Artifact-Focused Self-Synthesis for Mitigating Bias in Audio Deepfake Detection

Authors: Hai-Son Nguyen-Le, Hung-Cuong Nguyen-Thanh, Nhien-An Le-Khac, Dinh-Thuc Nguyen, Hong-Hanh Nguyen-Le

Published: 2026-03-27 13:36:11+00:00

Comment: Accepted at International Joint Conference on Neural Networks 2026

AI Summary

This paper introduces Artifact-Focused Self-Synthesis (AFSS), a novel method to mitigate bias in audio deepfake detection and improve generalization across unseen datasets. AFSS generates pseudo-fake samples from real audio using self-conversion and self-reconstruction, enforcing same-speaker constraints to compel the detector to focus solely on generation artifacts. Additionally, a learnable reweighting loss dynamically emphasizes synthetic samples during training, allowing the method to achieve state-of-the-art performance without relying on pre-collected fake datasets.

Abstract

The rapid advancement of generative models has enabled highly realistic audio deepfakes, yet current detectors suffer from a critical bias problem, leading to poor generalization across unseen datasets. This paper proposes Artifact-Focused Self-Synthesis (AFSS), a method designed to mitigate this bias by generating pseudo-fake samples from real audio via two mechanisms: self-conversion and self-reconstruction. The core insight of AFSS lies in enforcing same-speaker constraints, ensuring that real and pseudo-fake samples share identical speaker identity and semantic content. This forces the detector to focus exclusively on generation artifacts rather than irrelevant confounding factors. Furthermore, we introduce a learnable reweighting loss to dynamically emphasize synthetic samples during training. Extensive experiments across 7 datasets demonstrate that AFSS achieves state-of-the-art performance with an average EER of 5.45\\%, including a significant reduction to 1.23\\% on WaveFake and 2.70\\% on In-the-Wild, all while eliminating the dependency on pre-collected fake datasets. Our code is publicly available at https://github.com/NguyenLeHaiSonGit/AFSS.


Key findings
AFSS achieves state-of-the-art cross-domain generalization with an average EER of 5.45% and an average AUC of 98.15% across seven evaluation datasets. It demonstrates significant performance improvements on challenging real-world datasets, achieving 1.23% EER on WaveFake and 2.70% EER on In-the-Wild. Crucially, the method eliminates the dependency on pre-collected fake datasets by generating all training samples from real audio alone.
Approach
AFSS mitigates bias by generating pseudo-fake samples with authentic forgery artifacts while controlling confounding factors. It uses self-conversion (same-speaker voice conversion) and self-reconstruction (processing audio through neural vocoders) to create synthetic data. These methods ensure real and pseudo-fake samples share identical speaker identity and semantic content, forcing the detector to learn universal generation artifacts.
Datasets
ASVspoof 2019 LA (training and development), ASVspoof 2019 LA-Eval, ASVspoof 2021 LA (evaluation and hidden), ASVspoof 2021 DF (evaluation and hidden), WaveFake, In-the-Wild
Model(s)
XLS-R (front-end feature extractor), CNN-based classifier (back-end: ReLU, Dropout, Mean Pooling, Dense layer). For pseudo-fake generation: kNN-VC (self-conversion), HiFiGAN, WaveGlow, Hn-NSF, NSF-HiFiGAN (self-reconstruction).
Author countries
Vietnam, Ireland