Environmental Sound Deepfake Detection Using Deep-Learning Framework

View on arXiv ← Back to list

Authors: Lam Pham, Khoi Vu, Dat Tran, Phat Lam, Vu Nguyen, David Fischinger, Alexander Schindler, Martin Boyer, Son Le

Published: 2026-04-21 16:41:55+00:00

AI Summary

This paper proposes a deep-learning framework for environmental sound deepfake detection (ESDD), focusing on identifying fake sound scenes and sound events. The authors conduct extensive experiments to evaluate the impact of different spectrograms, network architectures, and pre-trained models. Their best model, finetuned from the pre-trained BEATs model with a three-stage training strategy, achieves high performance on benchmark datasets.

Abstract

In this paper, we propose a deep-learning framework for environmental sound deepfake detection (ESDD) -- the task of identifying whether the sound scene and sound event in an input audio recording is fake or not. To this end, we conducted extensive experiments to explore how individual spectrograms, a wide range of network architectures and pre-trained models, ensemble of spectrograms or network architectures affect the ESDD task performance. The experimental results on the benchmark datasets of EnvSDD and ESDD-Challenge-TestSet indicate that detecting deepfake audio of sound scene and detecting deepfake audio of sound event should be considered as individual tasks. We also indicate that the approach of finetuning a pre-trained model is more effective compared with training a model from scratch for the ESDD task. Eventually, our best model, which was finetuned from the pre-trained WavLM model with the proposed three-stage training strategy, achieve the Accuracy of 0.98, F1 Score of 0.95, AuC of 0.99 on EnvSDD Test subset and the Accuracy of 0.88, F1 Score of 0.77, and AuC of 0.92 on ESDD-Challenge-TestSet dataset.

Key findings

The study reveals that detecting deepfake sound scenes and sound events should be considered distinct tasks, although a model trained on sound events can generalize well to sound scenes. Finetuning pre-trained models proves more effective for ESDD than training from scratch. The proposed finetuned BEATs model achieved an Accuracy of 0.98, F1 Score of 0.95, and AuC of 0.99 on EnvSDD, and 0.88 Accuracy, 0.77 F1 Score, and 0.92 AuC on ESDD-Challenge-TestSet.

Approach

The approach involves transforming audio into spectrograms (CQT, MEL, Gammatone), applying Mixup data augmentation, and feeding them into a deep neural network backbone (e.g., EfficientNetB1) followed by an MLP for binary classification. A key contribution is a three-stage training strategy that combines A-Softmax, Contrastive, and Central losses initially, followed by Cross-Entropy with and without Mixup, especially effective when finetuning a pre-trained model like BEATs.

Datasets

EnvSDD, ESDD-Challenge-TestSet

Model(s)

BEATs (pre-trained), WavLM (mentioned in abstract but BEATs used in methods/results), ResNet50, InceptionV3, EfficientNetB1, DenseNet161

Author countries

Austria, Vietnam

← Previous