Toward Noise-Aware Audio Deepfake Detection: Survey, SNR-Benchmarks, and Practical Recipes

Authors: Udayon Sen, Alka Luqman, Anupam Chattopadhyay

Published: 2025-12-15 02:22:37+00:00

Comment: 6 pages

AI Summary

This paper addresses the performance degradation of state-of-the-art audio deepfake detection models in noisy, realistic capture conditions. It introduces a reproducible framework that mixes MS-SNSD noises with ASVspoof 2021 DF utterances to evaluate robustness under controlled Signal-to-Noise Ratios (SNRs). The study surveys and benchmarks pretrained encoders, demonstrating that finetuning significantly improves detection robustness at lower SNRs.

Abstract

Deepfake audio detection has progressed rapidly with strong pre-trained encoders (e.g., WavLM, Wav2Vec2, MMS). However, performance in realistic capture conditions - background noise (domestic/office/transport), room reverberation, and consumer channels - often lags clean-lab results. We survey and evaluate robustness for state-of-the-art audio deepfake detection models and present a reproducible framework that mixes MS-SNSD noises with ASVspoof 2021 DF utterances to evaluate under controlled signal-to-noise ratios (SNRs). SNR is a measured proxy for noise severity used widely in speech; it lets us sweep from near-clean (35 dB) to very noisy (-5 dB) to quantify graceful degradation. We study multi-condition training and fixed-SNR testing for pretrained encoders (WavLM, Wav2Vec2, MMS), reporting accuracy, ROC-AUC, and EER on binary and four-class (authenticity x corruption) tasks. In our experiments, finetuning reduces EER by 10-15 percentage points at 10-0 dB SNR across backbones.


Key findings
Finetuning pretrained encoders with multi-condition training significantly improves noise robustness, reducing Equal Error Rate (EER) by 10-15 percentage points at 10-0 dB SNR across backbones. Explicit four-class supervision (authenticity x corruption) aids in disentangling authenticity and noise cues, while frozen SSL encoders are not inherently noise-robust and degrade sharply at lower SNRs.
Approach
The authors develop a reproducible framework by mixing MS-SNSD noises with ASVspoof 2021 DF utterances to create a dataset with controlled SNRs ranging from 35 dB to -5 dB. They then evaluate pre-trained self-supervised speech encoders (WavLM, Wav2Vec2, MMS) on binary (real vs. spoof) and four-class (authenticity x corruption) tasks, comparing frozen versus finetuned configurations.
Datasets
ASVspoof 2021 DF, MS-SNSD (Microsoft Scalable Noisy Speech Dataset)
Model(s)
WavLM-base+, Wav2Vec2-base, MMS-300M
Author countries
Singapore