Toward Noise-Aware Audio Deepfake Detection: Survey, SNR-Benchmarks, and Practical Recipes

View on arXiv ← Back to list

Authors: Udayon Sen, Alka Luqman, Anupam Chattopadhyay

Published: 2025-12-15 02:22:37+00:00

AI Summary

This paper surveys and evaluates the robustness of state-of-the-art audio deepfake detection models against background noise by introducing a reproducible benchmark framework across controlled Signal-to-Noise Ratios (SNRs). It mixes ASVspoof 2021 DF utterances with MS-SNSD noises to quantify performance degradation from near-clean (35 dB) to very noisy (-5 dB). Finntuning the encoders showed substantial improvements in noise robustness compared to frozen baselines, particularly at low SNRs.

Abstract

Deepfake audio detection has progressed rapidly with strong pre-trained encoders (e.g., WavLM, Wav2Vec2, MMS). However, performance in realistic capture conditions - background noise (domestic/office/transport), room reverberation, and consumer channels - often lags clean-lab results. We survey and evaluate robustness for state-of-the-art audio deepfake detection models and present a reproducible framework that mixes MS-SNSD noises with ASVspoof 2021 DF utterances to evaluate under controlled signal-to-noise ratios (SNRs). SNR is a measured proxy for noise severity used widely in speech; it lets us sweep from near-clean (35 dB) to very noisy (-5 dB) to quantify graceful degradation. We study multi-condition training and fixed-SNR testing for pretrained encoders (WavLM, Wav2Vec2, MMS), reporting accuracy, ROC-AUC, and EER on binary and four-class (authenticity x corruption) tasks. In our experiments, finetuning reduces EER by 10-15 percentage points at 10-0 dB SNR across backbones.

Key findings

Finetuning the pretrained encoders substantially improved robustness, reducing the Equal Error Rate (EER) by 10–15 percentage points at 10–0 dB SNR across all backbones compared to frozen baselines. Frozen models were found to be brittle under noise, often confusing noisy real speech with noisy spoof speech, a problem mitigated by four-class supervision (authenticity x corruption). WavLM generally attained the highest mixed-test ROC-AUCs and proved the most robust among the frozen baselines.

Approach

The study uses self-supervised speech encoders (WavLM, Wav2Vec2, MMS) and evaluates them using binary (real vs. spoof) and four-class (authenticity x corruption) classification tasks. Noise robustness is assessed using fixed-SNR test sets, generated by mixing ASVspoof 2021 DF data with MS-SNSD noise, while models are trained using multi-condition, on-the-fly augmentation. Performance is compared between frozen (head-only) and end-to-end finetuned configurations.

Datasets

ASVspoof 2021 DF, MS-SNSD

Model(s)

WavLM-base+, Wav2Vec2-base, MMS-300M

Author countries

Singapore

← Previous